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PRIMERS FOR IDENTIFYING TYPING OR 
CLASSIFYING NUCLEIC ACIDS 

5 

DNA-sequence analysis is rapidly becoming a standard tool 
in modern, molecular biology research. Examples of applications include: 
Sequencing of unknown DMA-sequences, Identifying novel genes in 
stretches of sequenced DMA, Predicting protein-sequence and -structure 

10 from DNA-sequence alone and Identification of known gene- variations 
(sometimes called "typing a gene"). 

Typing of a gene could be crucial in some applications. For 
instance, organ-donation requires that the "immunological signature" of the 
donor matches that of the receiver. This "signature" is mediated by the 

15 Human Leucocyte Antigen (HLA) complexes (also known as Major 
Histocompatibility Complex, MHC) on the cell surface, and the 
corresponding genes are among the most varied in the human genome. 
Considering the importance of organ donation, the shortage of organ- 
donors and the fact that an organ cannot be stored for any longer time- 

20 periods, a rapid and accurate typing of the HLA-genes is required in order 
to make most use of the organs available for transplantations. 

Another application where a rapid and accurate identification 
of a gene is desired is when trying to identify unknown bacteria. A rapid 
identification of the bacteria causing the illness of a patient makes it 

25 possible to administer the correct medication early in the treatment of the 
disease, thus reducing the discomfort for the patient. Since every self- 
replicating organism so far studied use ribosomes when translating mRNA 
to proteins, analysis of one of the genes coding for the ribosome, for 
instance the 16S rRNA in the case of prokaryotes, could be used to identify 

30 the organism in question. 

There are several ways in which a gene can be identified, 
with the conceptually easiest being to sequence the entire gene and then 
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looking at the result. The main drawback is that this approach is time- 
consuming, and not easily scaled up using conventional methodology. A 
new method, Arrayed Primer Extension (APEX), lacks this drawback. 
APEX works by immobilising a large number of primers to a solid surface, 
thus creating a DNA-chip. These primers are constructed to be 
consecutively overlapping over the entire gene of interest, so that every 
base in the gene will have a primer to its 5'-end. By adding fluorescently 
labelled dideoxynucleotides, the primers will then be extended by one 
nucleotide using the sample DNA as template. It will thus be easy to check 
which nucleotide was incorporated, which in turn tells you the entire 
sequence of the sample DNA. 

Since some genes, like the HLA and 16S rRNA, have a large 
number of known variations, a prohibitively large number of primers have to 
be created in order to probe for all possible combinations of variant 
positions in the gene. Thus the array primer extension method APEX for 
resequencing would need more than 16,000 primers if all DQB alleles 
would be sequenced from a 500 bp long PGR fragment. If all DQB alleles 
in pairs should be combined the number of primers might be even higher 
which would be the situation for a heterozygote found in most individuals. 

But this might not be necessary, if some variations always or 
never occur together. This needs to be studied though, and a way found to 
determine the least number of primers (and what their sequences are) 
required for unambiguously identifying those genes. 

An object of this invention is to find and implement an efficient 
algorithm capable of doing just that The algorithm should preferably also 
take into account the melting points of the primers, so that the extension 
reaction can take place under optimal conditions for all of the primers on 
the chip. It should also minimise the number of "self-extended" primers, i.e. 
primers that can extend themselves without any sample DNA. This 
algorithm is then to be tested and evaluated on the HLA and 16S rRNA- 
genes. HLA is chosen partly because of the importance of rapid typing of 
these genes, leading to the fact that there are many other methods to 
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which APEX can be compared. It is also because the HLA-genes are 
"easy" to work with, since they rarely contain any insertions or deletions. 
These kinds of variations in the gene could potentially create problems 
when designing primers for APEX. The 16S rRNA, on the other hand, 
contains insertions and deletions and can thus be used to see if the 
algorithm can handle such variations. 

The invention provides a method of identifying a set of 
extendible primers for use in the identification, typing or classification of a 
nucleic acid of known sequence having known polymorphisms wherein: 
j) all possible nucleotide sequences of a chosen length of the 

nucleic acid are identified and their corresponding extendible primers, 
jj) at least one extendible primer is removed from the set 

wherein the at least one primer removed identifies a segment of the nucleic 
acid identified by at least one other primer. 
15 Preferably the method includes between step i) and ii): 

ia) potential extensions for each primer are identified with 

respect to each nucleotide sequence, 

lb) for each extendible primer the identified potential extensions 

are compared to detemiine which pairs of sequences can be discriminated 
20 by the primer. 

Preferably a matrix of primers and pairs of primer extensions 
is prepared in binary form and is subjected to analysis by a set covering 
problem (SCP) algorithm as described in more detail below. 

The invention also includes a set of extendible primers, for 
25 use in the identification, typing or classification of a nucleic acid of known 
sequence having known polymorphisms, identified by the method as 
defined. Preferably the primers are attached by 5'-ends to a surface of a 
support on which they are presented in the form of an array. 

In another aspect, the invention provides a set of extendible 
30 primers, for use in the identification, typing or classification of a human 
leucocyte antigen (HLA) gene as indicated, the set comprising about the 
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number of primers indicated and being capable of distinguishing about the 
number of alleles indicated: 





HLA gsne 




Ml imhfir of 




Alleles 


Primers 


Class 1 


HLA-A 


91 


172 




HLA-B 


200 


<1000 




HLA-C 


47 


94 


Class II 


DPA-1 


11 


26 




DPB-1 


74 


130 




DQA-1 


17 


130 




DQB-1 


34 


84 




DRB-1 


192 


<1000 




DRB345 


35 


94 



5 In another aspect, the invention provides a set of extendible 

primers, for use in the identification, typing or classification of 16S rRNA. 
wherein the set comprises about 210 primers and is capable of 
distinguishing at least about 1207 different sequences. 

In these aspects of the invention, the approximate number of 

10 primers is indicated. As indicated below, it may be possible by the use of 
the algorithms exemplified or other algorithms to generate slightly smaller 
sets of primers capable of distinguishing the number of alleles or 
sequences indicated, and these sets are envisaged according to the 
invention. Of course, other primers may be present in addition to those 

15 indicated as essential, and may be useful for checking purposes. The 
number of alleles or sequences Indicated represents the approximate 
known number of polymorphisms or different sequences, and these will 

surely increase with time. 

In another aspect the invention provides a method of 
20 identification, typing or classification of a nucleic acid of known sequence 
having known polymorphisms, by the use of the set of extendible primers 
as defined, which method comprises applying the nucleic acid or fragments 
thereof to the set of extendible primers under hybridisation conditions and 
effecting template-directed chain extension of extendible primers that have 
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formed hybrids. Preferably template-directed chain extension is effected 
using four different fluorescently labelled chain-terminating nucleotide 
analogues, and results are analysed by an imaging system such as total 
internal reflection fluorescence (TIRF) or scanning confocal microscopy. 
5 The various steps of the method may be performed as described in the 
literature for the known APEX technique. 

In another aspect the invention provides a kit for use in the 
identification, typing or characterisation of a nucleic acid of known 
sequence having known polymorphisms, comprising the set of extendible 

10 primers as defined. 

In another aspect the invention provides an array of sets of 
extendible primers as defined, for the simultaneous identification, typing or 
classification of two or more different HLA genes. 

With the present invention it has been realised that where a 

15 number of different alleles are to be identified, the total number of primers 
required to distinguish each of the alleles could be reduced as some 
primers would be common to all of the alleles, for example. Thus, with the 
present invention complete sets of primers for identification of each allele 
are identified and then the total number of primers in the combined sets is 

20 reduced using predetermined rules. 

Furthermore the present invention is based on the premise 
that as the primers are used to identify the presence or absence of a 
particular nucleotide sequence in any allele, the specific nucleotide that 
extends any particular primer is of less relevance than simply whether the 

25 primer has been extended. Thus, the problem of reducing the overall 

number of primers is greatly simplified rendering the problem one suitable 
for treatment as a Set Covering Problem (SCP). 

Embodiments of the present invention will now be described 
by way of example with reference to the accompanying drawings and 

30 examples, in which: 

Figure 1 is a diagram of a signal matrix in accordance with 

the present invention; 
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Figure 2 is a diagram of tlie corresponding binary matrix for 

the signal matrix of Figure 1 ; 

Figure 3 is a flow diagram of tlie steps for reducing tlie primer 

set in accordance with the present invention. 

5 The following is an explanation to assist in an understanding 

of the principles underlying the manner in which the number of primers 
used in the identification of a plurality of sequences may be reduced. 

Theoretically the number of primers required to identify k 
sequences grows as 0(k»l), where / is the length of the sequences as each 

10 sequence requires / primers. However, the less the sequences differ from 
one another, the fewer primers are required as many of the primers 
required for identification of a first sequence may also be of use in 
identification of another sequence. This effect becomes more pronounced 
the greater the number of sequences to be identified and the greater the 

15 similarities. 

Considering an initial set of n primers required in the 
identification of k sequences, a signal matrix of /c x n can be constructed. 
Each element in the matrix represents the signal, if any, that is generated 
by a particular primer with respect to a particular sequence. The signal will 

20 either be one of the four nucleotides 'A', 'C. 'G', or T' or no signal 

Figure 1 is an example of such a signal matrix where, for example, the 
signal generated by primer 2 with respect to sequence 3 is T'. 

The signal matrix is then converted into a binary matrix that 
represents whether the signals for any particular primer differ with respect 

25 to different sequences. Thus, again with respect to primer 2, the same 
signal 'G' is generated for both sequences 1 and 2 but a different signal T' 
is generated with respect to sequence 3. The binary matrix is constructed 
by considering each column (each primer) of the signal matrix and 
comparing each signal in that column in tum. Thus, as shown in Figure 2, 

30 the first row of the matrix represents a comparison of the signals for the first 
and second sequences, the second row represents a comparison of the 
signals for the first and third sequences and the third row represents a 
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comparison of the signals for the second and third sequences. Binary '0' 
represents the comparison revealing the same signal and binary '1' 
represents the comparison reveals different signals. In the case of primer 
2, as mentioned earlier the signals for the first and second sequences are 

5 the same ('0') whereas the signals for the first and third sequences are 
different ('1'). This conversion produces a matrix mxn where m=(k(k-1))/2. 
Hence, for large numbers of sequences, 2m grows approximately as the 
square of the number of sequences. Figure 2 shows the binary matrix for 
the signal matrix of Figure 1 . 

10 As the primers are required to enable the differentiation of 

sequences from one another, the reduction of the signal matrix to a binary 
matrix, representing differences in the signals obtained for different 
sequences, distils that element of infonnation necessary to enable a 
selection of the minimum number of primers necessary to identify the 

15 individual sequences. From the binary matrix the least number of columns 
are selected such that each row contains at least one non-zero element. 
Thus, if one of the columns contained all '1's only that one column would 
be required. However, in the case of Figure 2, there is no single column 
containing all 'Vs and so two columns must be selected, for example 

20 primers 1 and 2. Primers 1 and 2 together enable each of sequences 1 , 2 
and 3 to be differentiated and so the remaining primers are redundant. 

Where large numbers of sequences and primers are involved, 
the binary matrix renders the data contained within that matrix suitable for 
mathematical analysis. Once the selection of the reduced number of 

25 primers has been made, though, it is the signal matrix that is required 

during the use of the primers in the identification of the different sequences. 
Thus, the signal matrix is used to 'decode' the results of any analysis using 
the reduced number of primers. 

In practice, large numbers of sequences and primers are 

30 involved and the selection of a reduced set of primers cannot be performed 
by simple inspection of the binary matrix. For large numbers of primers, 
selection of a suitable reduced set of primers can be perfomned by treating 
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the selection as a Set Covering Problem (SCP). An SCR is an Integer 
optimisation problem and is well known in fields such as airline crew 
scheduling, selecting manufacturing equipment and ingot mould selection 
in steel production. In such large scale problems that cannot be solved 

5 exactly (NP-hard), heuristics are used in order to generate a solution. As a 
SCP is NP-hard, global algorithms and algorithms that identify local optima 
are not very suitable on their own for a large scale SCP. They will simply 
require far too much computation, as they try to find a solution that can be 
proven to be at least locally optimal. For this reason heuristic methods are 

10 required instead. They do not claim to give even locally optimal solution. 

but are much faster. 

Two known computational methods that have been found to 
be effective in identifying reduced sets of primers are the 'greedy' algorithm 
and Lagrangian relaxation algorithm. 

15 

Greedy Algorithm 

The most simple heuristic is the greedy algorithm, where 
columns are added one at a time. The column to be added in each step is 
chosen so as to cover as many uncovered rows as possible (a row is 
20 covered if it has at least one non-zero element). In other words, if Sr is the 
set of columns already included in the solution at iteration r, and Rr is the 
set of rows with no non-zero elements at iteration r, column jr is selected 
according to: 

ieRr 

y * = arg min Cj I Pj j € 
25 Equation 1 

This continues until all rows are covered, or until no more 
columns exist which can cover any of the rows still uncovered. Instead of 
minimising the term cj/Pj, other terms can be used. Example temis are cj, 
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Cj/log2 Pj or Cj/(Pj)2. Greedy algorithms of this type are described In "An 
Efficient Heuristic for Large Set Covering Problems". Vasko, Wilson, Naval 
Research Logistics Quarterly 1984, 31:163-171 the contents of which is 
incorporated herein by reference. The difference is In how much emphasis 
to place on the cost of the column versus how many rows the column 
covers. It is shown, however, that this entire class of heuristics share the 
same worst case behaviour. If we denote the set of columns in the solution 
as S and the solution value as Z, then the worst case behaviour can be 
described as: 



Equation 2 



where 

Hid) = J^-, d = max2^ay 
j-i J ■' <=i 

Equation 3 

In other words, how much worse the heuristic solution is 
compared to the optimal solution is dependent on the maximum number of 
non-zero elements In the columns. The advantage is that this algorithm Is 
fast, even though Its time complexity is O(m'n) (there can be a maximum of 
m columns in the solution, i.e. the maximum number of iterations is m. For 
each Iteration the matrix is traversed once to find the next column to be 
added). Altogether, we have that the time required to solve the problem In 
the worst case scenario will grow as the number of sequences to the power 
of five (four due to the number of rows, and one due to the number of 
columns). In the case of 16S rRNA (see later), where we have -1000 
sequences, the matrix will have -500,000 rows. The number of primers 
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(columns) is in tliis case -250,000. 

Laaranaian relaxation 

More sophisticated methods exist, which use other kinds of 

5 heuristics. One heuristic capable of generating the most optimal solutions 
is believed to be some kind of Lagrangian relaxation heuristic, where in 
each iteration the Lagrange multipliers for each column are used to 
calculate the Lagrangian cost for the columns. Such a Lagrangian 
relaxation heuristic is described in "A Heuristic Method for the Set Covering 

10 Problem", Capara et al Technical Report OR-95-8, Operations Research 
Group, University of Bologna 1995 the content of which is incorporated 
herein by reference. A near optimal vector of these costs is then calculated 
by a subgradient algorithm, before being used as input to a greedy 
algorithm. This is repeated until no improvements in the solution can be 

15 made. 

In Lagrangian subgradient methods the Lagrangian of the 
original problem is considered instead of the original problem. In this case, 
the Lagrangian will be 

Liu) = imnY,Cjiu)Xj + J^w,- 

7=1 1=1 




Equation 4 



where t/,- is the Lagrangian multiplier for row /. c/u) is the 
Lagrangian cost associated with column y, and is defined by 

m 

Cjiu) = Cj-J^ayUj 

1=1 

Equation 5 
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An optimal solution to Equation 4 is given by 

Oifc^(M)>0 
Xjiu) = \ lifcj(u)<0 

[o or 1 if c/m) = 0 

Equation 6 

L(u) can also be seen as an estimate of the lower bound for 
the solution, i.e. the sum of the costs for the columns in the optimal solution 
to the SCP will be > L(u). The solution to the SCP can be found by finding 
an optimal multiplier vector u instead, but this will require much 
computation especially for a large SCP. But near-optimal multiplier vectors 
can be found within short time by using the subgradient vector s(u), defined 
by 



s,(u) = l-J^Xjiu), i = l...nt 

7=1 

Equation 7 

j5 u can be refined iteratively by using for example 



M**' =max-! 



IK" )|| 



Equation 8 

where A > 0 is a step-size parameter and UB is an upper 
bound on the value of the solution. The initial a« can be defined arbitrarily. 
To solve the SCP. first a near-optimal multiplier vector u is found. This and 
Equation 6 is then used as a basis to fomn a feasible solution. The upper 
bound UB can then be updated to the value of this feasible solution (if it .s 
better than the previous best solution), and a new near-optimal multiplier 
vector found and so on until convergence is reached. 
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Another alternative computational method that may be 
employed to solve such a SCP is 'surrogate relaxation' in which in each 
iteration a corresponding continuous problem is solved and made feasible 
before a sub-gradient algorithm is applied. Alternatively, genetic algorithms 
5 may be employed in which the 'genome' consists of n bits, one bit for each 
of the columns. 

It should also be borne in mind that as the SCP operates on 
the binary matrix which only represents differences in signals between 
sequences for the same primer, a primer in the selected reduced set may 

10 generate a negative, signal rather than a positive signal, A, C, G, T. To 
be sure that the sample does in fact contain a particular sequence it is 
essential to ensure that for each sequence at least one primer generates a 
positive signal. Furthermore, in practice redundancy is desirable as all 
reactions may not occur as intended. Therefore, the least number of 

15 positive signals as well as the least number of differences in the signal 
pattern is preferably larger than one. 

With reference to Figure 3, the following is a description of 
one method of selecting a reduced set of primers. 

Firstly, all possible primers are selected (10) using the 

20 standard APEX procedure to produce a first set of primers. During this 
selection a substring of the sequence to be analysed is used to construct 
one primer, then the substring is displaced by one base and another primer 
is constructed. This process is carried out from the start of the sequence 
until the entire sequence has been covered. Both strands of DNA are used 

25 and this is repeated for all sequences. The primers should be long enough 
to be capable of discriminating between exact matches and mismatches 
involving one or two nucleotide pairs. Conveniently, the primers are 13bp 
long as this has been found to be sufficient to ensure the reaction, or longer 
to increase hybrid stability. However, to avoid steric hindrance on the chip 

30 each primer may be 5'-taiied. In this example, twelve T's are added to the 
5'-end of the primer so that the final length of the primers is 25bp. 

Next all primers that are not suitable as primers are rejected 
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(12) and the rest is included in a primary primer set. Unsuitable primers 
are those where the three bases at the 3'-end are complementary to any 
substring of the primer. In some instances this can result in the primer 
being extended by a neighbouring primer and not the sample DNA as a 
5 template and for that reason such primers are considered unsuitable. 

Also, any primers that would produce ambiguous signals are 
identified and rejected (14). A primer produces an ambiguous signal where 
It is not known which of the four bases is in the relevant position. 

Each of the remaining primers in the primary set primer is 
10 then compared to each sequence in turn to determine whether the primer is 
extendible by each sequence and if the primer is extendible the base with 
which it would be extended is determined. A signal matrix of the primers 
with respect to each of the sequences is thus generated (16). 

In order for a primer to be extended using the sample DNA as 
15 template, the three bases in the 3 -end of the primer must hybridise to the 
DNA. Otherwise the enzyme responsible for the extension will not be able 
to add a nucleotide to the primer. Of the rest of the primer (the poly-T tail 
excluded), at most two mismatches are allowed, othenvise the prImer-DNA 
duplex is considered to be too unstable to be extended. 
20 In ordinary PGR, ail the bases must match in order for the 

primer to be extended. But then the temperature is raised to the melting 
point, Tm, of the primer in the extension step. In APEX, this reaction is 
carried out at 45°C, which is around 10^-20° below Tm of most primers. 
This means that the primers will hybridise to the DNA despite a few 
25 mismatches, which is why two mismatches are allowed here. 

In some cases a primer could hybridise to a sequence in 
more than one position, and sometimes a primer could hybridise to both 
strands of one allele and give different signals. In those cases all the 
different signals are combined to form one resulting signal (e.g. *A' and 
30 together forms 'M', which is the NC-lUB (NC-lUB, 1985) code for this 
combination). 

For each column of the signal matrix the entries for each row 
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are compared against one another, in other words for each primer the 
signals produced by the primer for each sequence are compared against 
each other. A binary matrix is thus generated (18) of the primers with 
respect to the identity or difference of signals for pairs of sequences. The 
5 binary matrix contains non-zero entries where the primer is able to 
distinguish between a pair of sequences. 

The number of pairs of sequences that each primer can 
distinguish between are counted and a score is allocated to each primer 
(20) in dependence on the total number of pairs of sequences counted. 

10 Thus, the number of non-zero elements for each primer are counted. 

Primers that are unable to distinguish between any pairs of sequences are 
rejected (22) and the remaining primers are sorted (24) in order of their 
score with the primers with the higher scores at the beginning. 

A core of primers is created next (26). The primer with the 

15 highest score is selected. Where two primers with equal scores exist, the 
number of positive signals is determined for each and the primer with the 
greater number of positive signals is chosen. If both primers remain equal, 
one is then selected arbitrarily over the other After the main primer has 
been selected, the first twenty (five times the desired redundancy which is 

20 four here) primers giving positive signals for each sequence in turn are 
selected for the core. All remaining primers are rejected. 

A greedy algorithm is then run (28) using the core set of 
primers to identify the minimum number of primers necessary to distinguish 
each sequence. As the greedy algorithm is run, primers are added one at 

25 a time with each primer being selected in turn in relation to the number of 
uncovered rows it is capable of covering. When all rows are covered at 
least four times the reduced set of primers is checked for any sequences 
that has fewer than four positive signals and extra primers are added as 
necessary to meet this minimum requirement. 

30 A redundancy check is then performed (30) to identify 

whether any more primers can be removed. During the redundancy check 
each primer is "tentatively" removed in turn to see whether the remaining 
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primers meet the minimum requirements. 

If not, tlie next primer is tried. Otiierwise the primer is 
temporarily removed from the set, and the process continues with the next 
primer in line. This process continues until no more primers can be 
5 removed, in which case the last primer to be removed is added back to the 
set, and the next primer in line tentatively removed and so on. This can be 
viewed as a depth-first search of a tree where the nodes are combinations 
of primers, and the number of primers in each node is one less than in a 
node one level above. The root node thus contains all primers from the 

10 greedy algorithm. It has p (the number of primers after the greedy 
algorithm) primers in it. It also has p child-nodes (because there are p 
ways in which you can remove one primer from a set of p primers), each 
with p-1 primers. Each of them has p-1 children with p-2 primers and so 
on. In this way, all possible combinations of primers in the set fulfilling the 

15 requirements are found, and those combinations with the same, least 
number of primers are saved as the final primer sets. 

Instead of applying greedy algorithm to the core set a 
modified algorithm called CFT may be applied. 

20 Laqranaian subqradient 

This algorithm consists of three main phases: A subgradient 

phase where a near-optimal multiplier vector is found, a heuristic phase 

where a solution to the SCP is found and column-fixing, designed to 

improve the results of the heuristic phase. 
25 In the subgradient phase, a near-optimal multiplier vector u is 

found using Equation 8. At the beginning, the starting vector used is 

defined as 

U; =min — - — 

j 

Equation 9 
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Later calls use the last vector u before column fixing, and 
apply a small perturbation before using it as the starting vector. The 
perturbation is randomly (and uniformly) distributed in the range ±10% for 

5 each element. The sequence of multiplier vectors is considered to have 
converged when the Improvement in L(u) in the last 50 iterations is smaller 
than 0.1 %, or when the number of iterations reached 10 x m. The factor X 
in Equation 8 was set to 0.1 at the beginning, and was updated as follows: 
Every 20 iterations, the best and worst lower bounds L(u) during those 20 

10 iterations are compared to each other. If the difference is larger than 1 %, 
the value of A is halved. If the difference is less than 0.1 %, A is multiplied 
with 1 .5. In the first call, the upper bound, UB, used is the sum of the costs 
of the first primers that together cover all rows four times. Otherwise it is 
the value of the best solution found so far. 

15 In the heuristic phase, the last vector from the subgradient 

phase is used to generate a sequence of multiplier vectors (again using 
Equation 8), and a feasible solution constructed for each of the multiplier 
vectors. The procedure used to generate a feasible solution is a variation 
of the greedy algorithm, where each column is scored according to 



20 



Equation 10 



where R is the set of uncovered rows in each step. The 
column with the lowest oy, i.e. the columns with the best "gain/cost"-ratio, is 
25 added in each step to the solution. This continues until no improvements to 
the best solution (i.e. minimum number of primers) have been made for 50 
iterations. 
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After the heuristic phase column fixing is applied to the 
solution. Columns that are absolutely necessary in order for a row to be 
covered (i.e. if there are only e columns covering a row and each row is to 
be covered e times) are fixed. These fixed columns are then used as a 
5 starting point for the greedy algorithm, and the first max{L200/mJ, 1} 
columns chosen therein are fixed as well. 

These three phases are then applied again to the problem, 
with the condition that the fixed columns must be included in the solution 
this time. Columns already fixed in a previous round can not be removed 
10 from the solution. This goes on until either all rows are covered by the 
fixed columns, or the cost of the fixed columns is larger than the estimated 
lower bound for the entire problem or if no new columns were fixed in the 
last iteration. 

When the three phases are done, the problem is refined, in 
15 order to improve the solution. Here, each column in the best solution found 
so far is scored according to 

Equation 11 

where 

20 

Equation 12 

and S is the set of columns in the solution. The term Ui(Ki - 1) 
is the contribution of row / to the gap between the estimated lower and 
25 upper bound of the problem. This is then split uniformly between all 
columns in the solution covering that row. Columns with small Sj 
(contributing the least to the gap) are then likely to be part of the optimal 
solution. The p columns with the smallest Sj are then fixed before the entire 
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algorithm is applied again to the resulting sub-problem. (Column fixing 
here has nothing to do with column fixing after the heuristic phase, so 
columns fixed there need no longer be fixed here), p is the smallest value 
satisfying 



Equation 13 



where {//c} is the set of columns in the solution ordered with 
ascending Sj, and Ij is the set of rows covered by column j. u is in the range 

10 0...1 and controls the percentage number of rows removed after fixing, n = 
1 means that no rows will be uncovered, while tc = 0 means that no 
columns will be fixed before reapplying the algorithm. (Since each row has 
to be covered multiple times, in this case it is not actually the number of 
rows but the number of elements covering the rows that are regulated by 

15 7t). In the beginning, n is set to 0.3 and is multiplied with a = 1 .1 if the best 
solution so far was not improved in the last application of the three main 
phases. If a better solution was found, tc was reset to 0.3. Because of the 
density of the matrices, the number of columns fixed in this step was also 
set to be at least one more than in the previous iteration (if no 

20 improvements were made). Otherwise the same number of columns would 
be fixed in a number of iterations before the value of n is large enough to 
allow more columns to be fixed. 

The algorithm is iterated until either the value of the best 
solution is less than the estimated lower bound, all columns In the best 

25 solution found so far are already fixed in the refining step or a time limit is 
exceeded. The time limit in this case was arbitrarily set to as many 
seconds as there were rows in the problem. However, the time limit is only 
checked before the refining step. If it is not exceeded, a whole iteration of 
the algorithm will be executed before another check is done. Here too a 
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check was done afterwards to see if primers could be removed without 
breaking any constraints. 

With this algorithm no pricing is performed. Pricing is used to 
update the core problem, exchanging columns between the core problem 
5 and columns outside the core. It was not included here since it was argued 
that since the costs of the columns are all the same, the best columns 
would be those with the largest number of non-zero elements. These 
would be the first columns to be added to the core, and the columns not 
included in the core would most probably not be better than those included. 

10 Also, the pricing step will require some computation which will extend the 
time required by this algorithm. As is, the computational requirement of this 
algorithm is several orders of magnitudes higher than for the greedy 
algorithm. Finally, the main memory available in the computer puts a limit 
on the how large the problems can be. If pricing was included all data will 

15 not fit into the physical memory, forcing the computer to use a swap-file 
which would increase the cornputation times considerably. 

Using both alternative algorithms described above a minimum 
number of primers were identified for various sequences. The results are 
set out below. 

20 It will be apparent that the initial manual rejection of primers, 

steps (12, 14 and 22) need not be performed and instead the algorithms 
can be applied to the original complete set of primers. However, the initial 
rejection of obvious failed primer candidates can significantly reduce the 
computational time required in the later stages. Similarly, in many cases 

25 the final redundancy check (30) need not be performed as in many cases 
little or no reduction in the number of primers was achieved by this final 
check. 

Furthermore, although in the method described above the 
primers were initially sorted in order of score, this need not be performed. 
30 The algorithms for stripping out redundant primers are capable of operating 
with any order of primers including a wholly random order. However, 
slightly better results were obtained when ordering by score was 
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performed. 



Collecting sequences 



The HLA-sequences were available internally from 



5 Amersham Pharmacia Biotech (release December 1997), and included 91 
alleles from HLA-A, 202 HLA-B, 47 HLA-C, 1 1 HLA-DPA1 (coding for the 
a-chain), 74 HLA-DPB1 (p-chain), 18 HLA-DQA1, 34 HLA-DQB1, 192 HLA- 
DR1 and 35 sequences in all of HLA-DR3, -DR4 and -DR5. The length of 
these sequences range from ~250bp to --1 lOObp. 

10 The 16S rRNA-sequences were collected from GenBank 

(Benson et at., 1998), an annotated database of all publicly available DNA 
sequences. Only a subset of all the available 16S rRNA-sequences were 
used. The sequences used were all from organisms that could be 
identified using either the MicroLog or the MicroStation system from Biolog 

15 Inc., or the API systems from Counterpart Diagnostics. These systems 
utilise differences in metabolism in order to identify the organisms, which is 
the most common way of identifying micro-organisms today. Altogether, 
1207 sequences from 523 different organisms were collected from 
GenBank. 269 of those 523 organisms had only one 16S rRNA sequence 

20 among those 1207 sequences. The length of these sequences is between 
-lOOObp and ^1500bp. 



Data set No. sequences 

DPA1 1 1 

DPB1 74 

DQA1 17 

DQB1 34 

DRB1 192 

DRB345 35 

HLA-A 91 

HLA-B 200 

HLA-C 47 

168 rRNA 1207 



Mean length of sequences 



517 

288 
616 
490 
324 
400 
944 
900 
1003 
1452 



Table 1 : Details about data sets. 
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The program was written using the Microsoft® Visual C++®, 
version 5.0 compiler. It was executed on a PC with a Pentium® MMX 233 
MHz processor, 64 MB RAM and Windows® 95, unless otherwise 
indicated. All execution times are for the entire program, including I/O. 

5 As can be seen in Table 2, the binary SCP matrices were 

quite dense. The density (i.e. the number of non-zero elements in the 
matrix) usually lies around a few percent, of course depending on the 
application. A higher density means that fewer columns are needed in 
order to cover all rows. This is offset in this case by the fact that all rows 

10 were required to be covered multiple times. Another consequence of this 
high density is that the number of primers needed according to the greedy 
algorithm could be much higher than in the optimal solution. (Recall that 
the worst case behaviour of the greedy algorithm is a function of the largest 
column-sum of elements.) 



15 



Dataset DPA1 


DPB1 


DQA1 


DQB1 


DRB1 


DRB345 


HLA-A 


HLA-B 


HLA-C 


168 rRNA 


No. rows 55 


2701 


136 


561 


18336 


595 


4095 


19900 


1081 


727821 


Density (%) 47.89 


20.73 


36.31 


42.18 


24.98 


37.70 


36.31 


32.33 


30.41 


2.04 



Table 2: Some details about the binary SCP matrix. Data are 
calculated for all primers in the primary set. 

The program could be considered as consisting of two 
phases. The first phase involves constructing all primers and finding out 
20 what kind of signal they will get for each sequence. The second phase is 
the optimisation phase, were the SCP is solved. Some details about the 
first phase can be found in Table 3. 



Dataset 


DPA1 


DPB1 


DQA1 


DQB1 


DRB1 


DRB345 


HLA-A 


HLA-B 


HU\-C 


16S rRNA 


First set 


1747 


1885 


2487 


2891 


3891 


3031 


4756 


4994 


4293 


247877 


Primary set 


1333 


1475 


2166 


2730 


3651 


3016 


3886 


4585 


3354 


247877 


Core set 


106 


321 


213 


244 


385 


203 


595 


750 


338 


2377 


Time (s) 


4.67 


6.81 


11.26 


18.51 


42.29 


14.56 


124.74 


286.82 


61.29 


150632 



Table 3: Number of primers in different stages of the algorithm and time to 
25 get signals for all primers. The number of primers in the core are for 

homozygotes. 
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One explanation to this high density is that the sequences in 
the data sets are quite sinnilar to each other, so that most primers will 
hybridise to and give signal for more than one sequence (either the same 
or different signals). This is also indicated in Table 3, where for some data 
5 sets there is a noticeable drop from the number of primers in the first set to 
the number of primers in the primary set. Most of this reduction is due to a 
primer having the same signal for all sequences, which in turn means that 
all sequences have a substring that is similar enough for the primer to 
hybridise to and that the nucleotide after the primer is the same for all 

10 sequences. In contrast, the 16S rRNA data set has a much lower density, 
and no reduction in the primers going from the first set of primers to the 
primary set. As the sequences in this data set come from organisms which 
might be only distantly related to each other, there need not be as much 
similarity between the sequences as there is in the HLA data sets. Another 

15 explanation is this: If all k sequences except one give the same signal for a 
primer, that column in the binary SCP-matrix will have k-1 non-zero 
elements. The density (for that column) will then be {k-1) I {k(k-1)/2) = 2/k. 
In other words, the density will be higher for smaller values of /c, and 
smaller for larger values. This means that it would be "natural" for smaller 

20 matrices to have higher densities, and larger matrices to have lower 
densities. 

In the second phase, solving the SCP, a few different 
approaches were tried. The results, the minimum number of primers 
needed and the time required to find this number, can be found in Table 4 

25 and Table 5. Even though the worst case behaviour of the greedy 

algorithm is not so good in this application, the results are not much worse 
than when using a Lagrangian subgradient (CFT) method. The greedy 
algorithm typically needs two or three more primers, while the computation 
times are much lower for the greedy algorithm. 

30 The results show that it is worthwhile to check the results 

from the greedy algorithm for redundancy. In all cases except one primers 
could be removed and the resulting primer sets still fulfil all requirements. 
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This is not true for the CFT algorithm, however, as there is only one 
instance in which the result could be improved. On the other hand, since 
there is some randomness in the CFT algorithm (an old multiplier vector is 
disturbed randomly before being used as a starting vector in the next 
5 iteration), the results can differ from one execution of the algorithm to 
another. Sometimes the results can be improved, and sometimes not 
(results not shown). 



Dataset 


DPA1 


DPB1 DQA1 DQB1 DRB1 


DRB345 HLA-A 


HLA-B 


HLA-C 


16S rRNA 


Greedy 


11 


42 32 31 48 


24 


73 


103 


51 


210 


Time (s) 


0.27 


1.37 0.61 0.71 11.5 


0.66 


4.61 


31.36 


1.15 


9921.48* 


Final 


11 


41 30 29 44 


21 


72 


99 


47 


197^^ 


Total (s) 


0.27 


1.81 0.72 0.88 30.3 


0.71 


6.48 


85.14 


1.76 


>300000'^ 



Table 4: No. of primers after the greedy algorithm and time 
10 spent by it. Also final nr. of primers after check for redundancy and the total 
time spent solving the SCP. *Value from a 300MHz Pentium II with 512MB 
RAM running Windows NT 4.0. -^The computation was halted before 
completion due to time constraints. 



Dataset 


DPA1 


DPB1 


DQA1 


DQB1 


DRB345 


HLA-A 


HLA-C 


CFT 


10 


38 


26 


27 


20 


69 


47 


Time (s) 


10.22 


2748.92 


60.80 


372.56 


427.32 


4547.33 


1091.37 


Final 


10 


38 


26 


27 


20 


69 


45 


Total (s) 


10.22 


2749.14 


60.86 


372.61 


427.38 


4548.49 


1111.70 



15 Table 5: Results using modified algorithm CFT. 

One reason CFT is not much better than the greedy algorithm 
could be that it was designed for other instances of SCP. The SCP arising 
in this application differ in three aspects from those: A) The density is much 
20 higher, B) All rows are to be covered multiple times and C) The costs of all 
columns are all the same. 

A comparison was made between the results from the greedy 
algorithm and from CFT in Table 6. Most of the primers (70% or more) 
were chosen by both algorithms, indicating that these primers are likely to 
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be part of an optimal solution. However, this is only an indication as the 
only way to prove this is to find an optimal solution. This will require far too 
much time even for the smallest data set as the problem is NP-hard. 



Dataset 


DPA1 


DPB1 


DQA1 


DQB1 


DRB345 


HLA-A 


HLA-C 


Greedy 


11 


41 


30 


29 


21 


72 


47 


CFT 


10 


38 


26 


27 


20 


69 


48 


Same 


7 


33 


22 


22 


14 


62 


38 


Percent (%) 


70.00 


86.84 


84.62 


81.48 


70.00 


89.86 


80.85 



5 Table 6: Comparison of primers from the two different 

algorithms. 

Results from combining HLA sequences in order to 
differentiate between heterozygous individuals can be found in Table 7. 

10 CFT was only used for the two smallest data sets due to the time re- 
quirements. It performed slightly better than the greedy algorithm on those, 
but only by one primer on each data set. There are heterozygotes that can 
not be distinguished from another heterozygote, which can be seen in 
Table 7, This happens because the combination of two sequences to form 

15 one heterozygote could result in exactly the same signal pattern as another 
combination of homozygotes. In other words, some rows in the signal- 
matrix will be the same leading to some rows in the binary SCP-matrix not 
containing any non-zero elements at all. For some of those pairs listed, 
this is not true, however. They are listed because there were not enough 

20 primers that have different signals for these pairs, and so could not meet 
the requirement of at least four different signals in the signal patterns 
(Table 8). For the rest, it is simply a limitation of this technique to type 
HLA-genes. To be able to identify the alleles forming each heterozygote, 
primers that amplify alleles selectively should be used in the PCR step. 

25 This will remove the ambiguities as some heterozygotes simply will be 
transformed to homozygotes since only one of the alleles in the 
heterozygote will be amplified and not the other. 
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Dataset 


DPA1 


DPB1 


DQA1 


DQB1 


DRB345 


HLA-A 


HLA-C 


Greedy 


26 


130 


51 


81 


94 


172 


94 


Time (s) 


0.99 


9229.57 


7.41 


294.51 


453.19 


20826.20* 


1212.59 


CFT 


25 




50 










Time (s) 


1943.82 




8427.82 










Amb. het. 


0 


16 


2 


2 


6 


19 


4 


Percent (%) 


0.00 


0.58 


1.31 


0.34 


0.95 


0.45 


0.35 



Table 7: Results from heterozygous pairs. Number of primers 
needed, the time spent, how many heterozygotes that did not differ by at 
5 least four signals from any other heterozygote and the percentage of total 
number of heterozygotes. *Value from a 300MHz Pentium II with 512MB 

RAM running Windows NT 4.0. 

Unfortunately, it was not possible to obtain any results for 
10 heterozygotes for the data sets DRB1 and HLA-B, as these were too large 
to run on existing machines. A very approximate extrapolation of the 
primers needed for these data sets suggests that the total number of 
primers for all HLA sets together would be <1000, which can placed on one 
chip without problem (one chip can contain up to -5000 primers). Without 
15 the reduction obtained above, at most two genes could be tested on each 
chip. With the reduction, all nine HLA genes and the 16S rRNA gene can 
be tested on one chip, and with plenty of room to spare for other genes as 
well. This makes APEX more versatile, as it allows a family of related 
genes to be tested using only one chip instead of several. 
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DPB1 



Pair 1 DPB1"0501 DPB1*2101 

Pair 2 DPB1'2201 OPB1-3601 

No. dlff. 2 

Pair 1 DPB1*0501 DPB1*5501 

Pair 2 DPB1*3001 DPB1*6301 

No. diff. 2 

Pair 1 DPB1'0601 DPB1*3601 

Pair 2 DPB1-20011DPB1-2101 

No. dlff. 1 

Pair 1 DPB1-0801 DPB1"1401 

Pair 2 DPB1*1001 DPB1'5701 

No. dlff. 0 

Pair 1 OPBI'0901 OPB1*3001 

PaJr2 DPB1M701 DPB1*5401 

No, dlff. 0 

Pair 1 DPB1*0901 DPB1*3601 

Pair 2 DPB1*2101 DPB1'3501 

No. dlff. 0 

Pair 1 DPB1*0901 DPB1*4501 

Pair 2 DPB1'1001 DPB1M401 

No. diff. 0 

Pair 1 DPB1'3901 DPB1*5301 

Pair 2 DPB1*4001 DPB1M901 

No. dlff. 0 



DQB1 



Pair 1 
Pair 2 
No. dlff. 

Pair 1 
Pair 2 

No. dlff. 



DQA1-0101 
DQ A1'01 01 



DQBV0604 
DQB1 *0608 



DQA1*0104 
DQA1*0105 



DQB1*0B12 
DQB1"0609 



ORB345 



DRB4 '0 101 1DRB4'0 1011 
DRB4*01011 DRB4-0301N 



HLA-C 



Pair 1 
Pair 2 
No. dlff. 



Pair 1 ORB4*0101 1 DRB4-0103 

Pair 2 DRB4''0103 DRB4*O301N 
No. diff. 0 

Pair 1 DRB4*0201NDRB4*0201N 

Pair2 ORB4*0201 NDRB4*0301 N 

NO. dlff. 0 

Pain CW1203 Cw*1602 

Pair 2 Cw-12042 CW1601 
No. dlff. 0 



Pair 1 CW12042 CW1502 
Pair 2 CW1205 CW1503 
No. diff. 0 



Pain A*0101 A*2411N 

Pair 2 A*0104N A-2402 
No. dlff. 0 

Pair 1 A*0201 A-0205 

Pair 2 A-0202 A*0206 
No. dlff. 1 

Pair 1 A*0201 A-0205 

Pair 2 A-0214 A-0222 
No. diff. 1 

Pairl A-0201 A-0208 

Pair 2 A-0205 A*0220 
No. dlff. 0 



A*0201 A'0213 
A*0212 A*0226 
2 



Pair 1 
Pair 2 
No- dlff. 

Palrl A*0201 A*2406 

Pair 2 A*0222 A'2413 

No. dlff. 0 

Pair 1 A-0202 A*0206 

Pair 2 A-0214 A*0222 

No. dlff. 0 



Pair 1 
Pair 2 
No. dlff. 

Pair 1 
Pair 2 
No. diff. 



A'0212 A-2601 
A-0222 A'2608 
2 

A'2402 A*2502 
A*2407 A-2501 
0 



Pairt A-2402 A*68012 
Pair 2 A*2407 A-68031 
No. diff. 0 



Pair 1 
Pair 2 
No. dlff 



A*2501 A-68012 
A'2502 A-B8031 

0 



Table 8: Heterozygous pairs that do not differ enough in their signal 
patterns, and how many signals they differ with. 

The results of this work are summarised in the following 



Table 9 
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Class 1 


Number of 


Primers 


Class II 


Number of 


Primers 




alleles 


needed 




alleles 


needed 


HLA-A 


91 


172 


DPA1 


11 




HLA-B 


200 


<1000 


DPB1 


74 


130 


HLA-C 


47 


94 


DQA1 


17 


51 








DQB1 


34 


84 








DRB1 


192 


<1000 








DRB345 


35 


94 



Table 9. Number of primers needed to discriminate between 
heterozygote HLA samples. 

5 

Some sets of primers indicated in Table 9, and also the set 
indicated for 16S rRNA, are set out in appendix 2. 

Primers can be arranged on the surface of a support in such 
a way that different studied types, genes, alleles, species etc. forni easily 
10 recognised characters such as figures or letters. These character forming 
primers can be additional primers of common origin from the gene of 
interest and be used for validation of the process. 

The following demonstration is based on the HLA Class II 

DQB gene. 

15 

Experimental 

Materials 

Amplification: 

20 DNA: Four homozygote for DQB cell lines, with alleles 0402, 0301 , 0601 1 
and 0201. 

Primers: Primer DQB 9246 from Williams et al. -96 and DQB 96012 from 
Amersham Phannacia Biotech HLA DQB typing kit, covering exon 2, 
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generating a fragment of 300 base pairs. 

Amplification reagents: PGR mix from the Amersham Pharmacia Biotech 
HLA DQB typing kit, a prototype kit. 

All amplifications were spiked with dUTP. to get a final concentration of 100 
5 or 200 mM dUTP. 

Enzymes for fragmentation of PGR products: 
Shrimp alkaline phosphatase (SAP)1 U/^il APB. 
Uracil-DNA-glycosylase, (if from PE UDG = UNG) 1 U/^l NE Biolabs. 

10 

SAP will degrade (dephosphorylate) all free dNTPs and UDG 
will remove all dU from the DNA and after heating the strands will be 
broken at these points. This step is applicable to any DNA fragment. 



15 Primers for spotting: 

All 84 primers for the 500 bp fragment were ordered from 
LTI/GIBGO BRL Custom primers service. All were 25-mers with an amino- 
activated 5' -end. For primer sequences see appendix 1 . Self extended 
primers were N, A, C, G and T as controls with the following sequences: 

20 N: amino TTTAGCCTTAACGCCTNTGAGGTCA 

A, G,G, T: amino TTT AGC CTT AAG GGG T X TGAC GTGA, where X is 
A, C, G or T. 

Extension reagents for the APEX reaction 

25 Dyes: Specially synthesised for Baylor by Du Pont and /or APB 

Cy2 - ddCTP (equal to fluorescein) 50 ^M 
Cy3 - ddATP 50 ^M 

Texas Red - ddGTP 50 nM 

Cy5 - ddUTP (often written as T in many of the reactions and 

30 results) 50 

10x ThermoSequenase™ DNA polymerase buffer (TS): 
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260 mM Tris-HCI pH 9.5; 65 mM MgCIa, ThermoSequenase DNA 
polymerase (Amersham Pharmacia Biotech) 4 U/^l, if needed dilute with 
T.S. dilution buffer (=10 mM Tris-HCI pH 8.0; 1 mM p-mercaptoethanol, 
0.5% Tween - 20(v/v), 0.5% Nonidet P-40 (v/v). TS was used from a 150 
5 unit stock and diluted 1 ^1 + 37 ^l dilution buffer. 

Methods 

Preparation of glass slides before spotting of primer: 

Arrange 25-30 cover slips (24 x 60 mm) in a stainless staining 

10 tray. 

Immerse the tray in glass staining dish with acetone to fully 

immerse slides. 

Place the glass staining dish in sonicatorfor 10 minutes. 
Remove the tray from acetone bath, shake of excess of 
15 acetone and rinse several times (at least twice) in MilliQ water. 

Immerse tray in 100 mM NaOH and sonicate for 10 minutes 
(a few more minutes, no problem). 

Remove the tray and shake of excess of NaOH and rinse 
several times (at least twice) in MilliQ water. 
20 Immerse tray in silane solution and sonicate for 2 minutes. 

Wash slides by immersion in 100% EtOH once. 
Dry the tray with the slides using nitrogen with a high velocity 
(without breaking the slides). 

Cure the slides in a vacuum oven at lOO^C over night or until 
25 they are used for spotting (at least 20 minutes vacuum is needed). 

S potting of oliaos: 

All spotting was done with a spotter with 96 parallel capacity. 

Each slide was spotted with three replicas of the primers. 
30 After spotting the slides were allowed to air dry for 5 to 1 5 

minutes, when dried they were marked. They were stored at room 
temperature, in a dry place, in the trays until used. 
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DQB amplification 

The DQB amplification was done according to the method 
described by Williams et a/. -96 using a 33% dUTP mix. After 40 cycles 
(95°C, 30 sec; 55°C, 30 sec; 72°C, 30 sec), one microliter of the PGR 
products was tested on a 1 .5% agarose gel. before the fragmentation step. 

Williams. Bassinger, Moehlenkamp, Wu, Montoya, Griffith. 
McAuley, Goldman, Maurer: Strategy for distinguishing a new DQB1 allele 
(DQB1*061 1) from the closely related DQB1*0602 allele Tissue Antigens, 
1996,48:143-147. 

Fragmentation of PG R products: 

Before APEX can be done all DNA fragments must be 
fragmented so all new fragments can get access to the primer on the chip. 



15 Set up: 

5 nl DNA from a PGR reaction (1/10 of the PGR reaction) 
2 [i\ SAP (Shrimp alkaline phosphatase) 1 U/|al APB 
1 |xl UDG (Uracil-DNA-glycosylase) 1U/hI NE Biolabs 
15 |xl water 
20 Total: 23 ^il 

Incubate 37°G for 2 hour. 

The samples were frozen and stored until they were used. 

Inactivation of enzymes at 100°G for 10 minutes can be done, 
but not needed since this is the first step in the APEX reaction. 



25 



30 



Extension method for the APEX reaction 

Slide treatment: 

Start with washing the slides in hot water (90 - 98°G, not 
boiling) for 2 X 5 minutes in a 50 ml Flacon tube. When the slides are 
ready, remove them from the tube with a forceps and place them on a dry 
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heater block at 48°C. The slide(=DNA chip) is now ready for adding the 
reactions. 

APEX reactions set up: 

5 

23 nl DNA from the fragmentation step. 

3 nl 10x TS reaction buffer (the rest of the buffer comes from PGR and 
UDG cleavage) 
17 ^1 for cover slip method. 
10 Heat denature at 100°C for 7 - 10 minutes, target 8 minutes, not longer. 
Spin the tube quickly and add quickly 
1 |al ThermoSequenase DNA polymerase (4U) 
1 |al Dye-mix (50 |aM of the four dideoxynucleotides A, C, G, and T, 

separately dye labelled). 
15 Then the reaction mix was physically spread out over the 

primer array with the tip of a pipette tip. Incubate at 48°C until no trace of 
solution is seen. This takes about 8 minutes. 

Wash with hot water for 2 - 5 minutes, 2 times. Ready to 
read on detection instrument. 

20 

Detection 

The detection system is a total internal reflection fluorescence 
(TIRF) system, where microscopic slides are placed on top of a prism with 
oil on to link a laser beam in to the glass slide. The system has light of five 
25 different wave lengths from five different lasers to vary between. In this 
experiment only four were used. To detect Cy2 a laser with 488 nm was 
used, for Cy3 a 532 nm, for Cy5 a 635 nm and for Texas Red a 670 nm 
laser were used. Image related software were based on Image Pro Plus 
3.0. 

30 
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Results 

Amplification of HLA DQB alleles 

The DNA from the four DQB homozygote cell lines were 
5 amplified according to the protocol in Williams et al. -96 with two different 
concentrations of dUTP. In addition to this, DNA from six different 
heterozygotes were amplified. All amplifications worked well and the 
expected 300 bp fragment were seen from all samples. 

10 APEX reaction with DQB chip 

Primer chips were washed and fragmented PGR products 
were incubated on the chip according to the protocol. The image was 
compared to the expected pattern. The expected pattern was similar to but 
somewhat different from the recorded pattern, the reason for this is that the 

15 set up was planned for a 500 bp fragment, but the actual fragment used 
was a 300 bp PGR fragment. 

Homozvoous ceil lines results 

Figure 4 shows the results from a cell line homozygous for 
20 the DQB 0204 allele. The pattern shown in the image is very close or 

similar to the expected results from exon 2. 

In all reaction the control primers worked well and the four 

dyes were used in the same frequencies. In the case with a 500 bp 

fragment for DQB typing the primers for allele 0402 were placed in such a 
25 way that they formed figures. In Figure 4, panel D, most signals are seen 

forming a "2" from the 300 bp fragment, and the missing signal will be seen 

when the large PGR fragment is used. This clearly shows that primers can 

be placed in a clever way to form figures. 

30 Heterozvaous results 

For the heterozygous test only one of the four dye reactions 
worked. Some of the expected spots from the heterozygous sample were 
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not seen, but this is probably due to the fact that no control signals were 
seen in the lower right hand corner, where the signals were weaker then in 

other part of the slide. 

As this experiment shows, a limited number of primers can be 
used for HLA typing and if they are placed in a clever way the interpretation 
of the results is very simple. Both homozygous and heterozygous samples 
can be correctly analysed with this method. 



10 



15 



20 



25 



30 



Continuation 

An algorithm was developed in order to select the minimum 
number of primers needed to identify different genes using APEX. It was 
applied to the following HLA genes: HLA-A, HLA-B. HLA-C, HLA-DPA1 , 
HLA-DPB1, HLA-DQA1. HLA-DQB1, HLA-DRB1 and HLA-DRB345. It was 
also applied to the 16S rRNA gene. In the case of HLA-DQB1 . the primers 
have been shown to work as intended. As is. a few assumptions were 
made (such as how many mismatches to be allowed between the primers 
and the sample DNA) that need to be tested and possibly refined. 

Another improvement that can be made is the following: As is. 
the program works only with discrete signals, e.g. either there is a signal 'A' 
or there is not, either there is a signal 'G' or there is not and so on. A more 
precise approach would be to predict how strong the signals will be for 
each primer on each sequence. A rough estimate of the signal strength 
should be possible given some thermodynamic data about the primers, 
most notably their melting points. With this information, and knowing the 
concentration of DNA in the sample among other things, the proportion of 
primers on the chip that will actually react with the sample DNA should be 
possible to estimate. It would thus allow a rough estimation of what 
strength the different signals will have. It will not be very precise, and the 
estimate might possibly be off by a factor 2 or more, but it will still give 
some information about what signals to expect from the chip. 

Given the melting points of the primers, the temperature at 
which the reaction on the chip is carried out could be optimised as well. 
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Since the sequences are known, it is possible to estimate the melting point 
of any primer to any sequence when there are a few mismatches. This 
could be done for all primers on all sequences, and a range of 
temperatures calculated. The actual temperature to use could then be 
chosen so as to be as optimal for as many primers on as many sequences 
as possible, instead of as now at a standard temperature. 

Another possibility would be to try other heuristics to solve the 
resulting SCP. Even though CFT does give better results than the greedy 
algorithm, it is not by much. It could be that Lagrangian relaxation methods 
) really are not suitable for unicost problems, but the only way to find out is to 
try heuristics based on other ideas. It might be possible to reduce the 
binary SCP-matrix as well, before applying any heuristic on it. Some rows 
in the matrix could end up the same, in which case one of them could be 
removed in order to reduce the number of rows and thus speed up 
5 computation. No figures of how many rows might be the same exist, but it 
could be worthwhile examining this possibility to reduce problem size. 

The algorithm itself could be improved. The complexity of the 
redundancy-check phase can be slightly reduced by having a vector 
consisting of the sums of the rows in each node. For each child-node, the 
to column to be removed is then subtracted from this vector of sums. This 
operation can be carried out in 0(m), and the final complexity will then be 
0(m xN(p, p)) instead. For the greedy algorithm, another possible 
improvement is to check the primer set for redundancy each time a primer 
was added. The complexity for the greedy algorithm will be the same, as 
25 the check will take 0(m xp) (i.e. same as each iteration in the greedy 
algorithm) each time (with the improvement just mentioned). The check 
could take longer, but that is unlikely as that would imply that one primer 
could make several other primers redundant. The main advantage is, of 
course, that no redundancy check with its rather high complexity is needed 
30 afterwards. 

The most serious problem is the sheer size of the problems. 
For the 16S rRNA data set, around 300 MB is required just in order to store 
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all the primers and their signals. Add to that the fact the all primers need to 
be traversed once for every iteration in the greedy algorithm, and the result 
is that it will take quite some time as well. This also means that it is not 
even feasible to use more elaborate algorithms such as the CFT algorithm 
on the 16S rRNA data set, unless a much more powerful computer is 
available. On the other hand, algorithm CFT would probably benefit quite a 
lot from a parallel computer, since much computation could be carried out 
as vector-operations. It should then be possible to spread out all 
computations on several processors, thus reducing the time required. It 
would also reduce the memory requirements on each processor (but then 
parallel computers tend to have enough memory to store all necessary data 
for this problem on each processor anyway). Even the greedy algorithm 
would benefit from a parallel computer, as each processor can be charged 
with the task of scoring only a subset of primers. It is not as critical in this 
case, though, since the computation times are not very high when using the 

greedy algorithm. 

As is, this method is only capable of identifying known gene- 
variants. If applied to a sample with a previously unknown variant, it is very 
probable that this new variant will be falsely identified as one of the known 
variants. It would be very advantageous if this method could be 
augmented in some way to recognise this fact, and give a warning if there 
could be an unknown variant in the sample. It could be done by giving a 
warning when the signal pattem gained differs from the signal pattem from 
any known variants, but this might not be enough. There is no guarantee 
that the new variant could not differ in some place not affecting any of the 
existing primers, which would lead to the new variant being 
Indistinguishable from any of the known variants. Some other way is 
probably needed as well. 
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APPENDIX 1 

Primer sequences for DBQ heterozygote typing 

Primers 'dqbl -1" to 'dqbl -8' placed in positions A3-A10 
Primers 'dqb1 -9' to 'dqb1 -18" placed in positions B2-B11. 
Primers "dqbl -19' to 'dqbl -30' placed in positions C1-C12. 
Primers 'dqbl -3V to "dqbl -42' placed in positions D1-D12. 
Primers "dqbl -43' to 'dqb1 -54' placed in positions E1-E12. 

Primers 'dqb1 -55' to 'dqbl -66' placed in positions F1 -F12. 
Primers "dqbl -67' to 'dqbl -76' placed in positions G2-G1 1. 
Primers 'dqb1 -77' to 'dqbl -84" placed in positions H3-H10. 

dqb1-1 NH2 - TCC ATC ACA GGA GTC AGA AAG GGC T 
dqb1-2 NH2 - GTG TGC AGA GAG AAC TAG GAG GTG G 
15 dqbl -3 NH2 - GCG GTG AGG GTG CTG GGG CTG CCT G 
dqb -4 NH2 - TAA TGA GGG GGG TGG ACA C/^ GGC 0 
dab1 -5 NH2 - GCG GTG ACG CCG CTG GGG CCG CCT G 
dqb1-6 NH2 - GGA CAT CCT GGA GGA GGA CCG GGC G 
dqb1-7 NH2 - GTG GTG ACG CCG CTG GGG CCG CCT G 
20 dabl1-8 NH2 - TCC GTC AAA GGA GTC AGA AAG GGC T 
S % NH2 - GAT GTA TCT GGT CAC ACC CCG CAC G 
dqb1-10 NH2 - CCG AGT ACT GGA ATA GCG AGA AGG A 
dqb - 1 NH2 - GAT GTG TCT GGT CAC ACC CCG CAC G 
dab1-12 NH2 - GGG TGG ACA CAA CGC CGG CTG TCT C 
25 dqb1-13 NH2 - GGG TGG ACA ^GC CGG TTG TC^^^ 
dab1-14 NH2 - CTT CTG GCT ATT CCA GTA CTG GGC G 
dab1-15 NH2 - TTC CGG GCG GTG ACG CTG CTG GGG C 
S - 6 NH2 - GCTTCG ACA GCG ACG TGG GGG TGT A 
dqb - 7 NH2 - GCT GTT CCA GTA CTC GGC GCT AGG C 
30 Sb - 8 NH2 - CTT CTG GCT GTT CCA GTA CTC GGC G 
d2b - 9 NH2 - ACC GTG TCC AAC TCC GCG CGG GTC C 
dS -20 NH2 - CAC AAC GCC GGT TGT CTC CJC CTG G 
dab1-21 NH2 - CTC CTC CTG GTC ATT CCG AAA CCA C 
dqb1 -22 NH2 - CCA GGA TCT GGA AAG TCC AGT CAC C 
35 dqb1 -23 NH2 - GAG CGC GTG CGT CTT GTA ACC AGA T 
2qb -24 NH2 - GAC ATC CTG GAG AGG AAA CGG GCq ^ 
dqb1-25 NH2 - AGA GAC TCT CCC GAG GAT TTC GTG T 
dqbl -26 NH2 - TAG TTG TGT CTG CAC ACC CTG TCC A 
dqb1-27 NH2 - ACG TAC TCC TCT CGG TTA TAG ATG T 
40 dSb -28 NH2 - GCT TCG ACA GCG ACG TGG AGG TGT A 
dab1 -29 NH2 - TCC GTC CCA TTG GTG AAG TAG CAC A 
dqb1 -30 NH2 - TGA TAA GGC CCA GCC CGA GGA AGA T 
dqb1 -31 NH2 - GGG TGG ACA CAA CGC GAG TTG TCT C 
dqb -32 NH2 - GGG TGG ACA CAA CGC GAG CTG TCT C 
45 dS -33 NH2 - GAC AGG GAC GTG GAG GTG TAC CGG G 
dab1 -34 NH2 - TCC GTC CCG TTG GTG AAG TAG CAC A 
?qb1 -S NH2 - GCA CGA CCT TGC AGC GGC GAC CCC A 
dqb1-36 NH2 - GAA CAG CCA GAA GGA AGT CCT GGA G 
d?b1-37 NH2 - CTT CTG GCT GTT CCA GTA CTC GGC A 
50 d?b1-38 NH2 - AAC GCC AGC TGT CTC TTC CTG GTC A 
dqb1-39 NH2 - GAG AGG ACC CGG GCG GAG TTG GAC A 
dab1-40 NH2 - GCA GGC GGC CCC AGC GGC GTC ACC A 
dS 2l NH2 - GTC GCT GTC GAA GCG CAC GTC CTC C 
Sqb I2 UhI - 5tC TGT CCT GGA TGG GGT CGC CGC T 
55 dqb1-43 NH2 - ACG GGA CGG AGC GCG TGC GTT ATG T 
dqb1 -44 NH2 - GAA GTA GCA CAT GCC CTT AAA CTG G 
d5b1-45 NH2 - TCG GTG GAC ACC GTA TGC AGA CAC A 
dqb1-46 NH2 - GGA M CGT GTA CCA GTT TAA GGG C 
dqb1-47 NH2 - ACG TAC TCT TCT CGG TTA TAG ATG T 
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dqb1-48 NH2 - GAG AGG ACC CGA GCG GAG TTG GAG A 
dqb1-49 NH2 - ACC CCA GCC TCC AGA GCC CCA TCA C 
dqb1-50 NH2 - CAA CGG GAC GGA GCG CGT GCG GGG T 
dqb1-51 NH2 - ACA TCT ATA ACC GAG AGG AGT ACG C 
5 dqb1 -52 NH2 - GAA CAG CCA GAA GGA CAT CCT GGA G 
dqb1-53 NH2 - CCT TCT GGC TAT TCC AGT ACT CGG C 
dqb1-54 NH2 - TTA AGG CCA TGT GCT ACT TCA CCA A 
dqb1-55 NH2 - TTC AGA TTG AGC CCG CCA CTC CAC G 
dqb1-56 NH2 - ATC TGG TCA CAA GAC GCA CGC GCT C 
10 dqbl -57 NH2 - AGT AGC ACA GGC CCT TAA ACT GGT A 
dqb1-58 NH2 - ATG TAT CTG GTC ACA CCC CGC ACG A 
dqb1-59 NH2 - ATC TGG TCA CAT AAC GCA CGC GCT C 
dqb1-60 NH2 - ATC AAA GTC CAG TGG M CGG AAT G 
dqb1-61 NH2 - ACG TGG GGG TGT ATC GGG TGG TGA C 
15 dqb1 -62 NH2 - ATC AAA GTC CGG TGG M CGG AAT G 

dqbl -63 NH2 - GTA TCT GGT CAC ACC CCG CAC GAG C 
dqb1 -64 NH2 - CGC TGT CGA AGC GCA CGT CCT CCT C 
dqb1-65 NH2 - GGA M CGT GTT CCA GTT TAA GGG C 
dqb1-66 NH2 - TGT GGG CTC CAC TCT CCT CTG CAA G 
20 dqb1 -67 NH2 - ACG TCC TCC TCT CGG TTA TAG ATG T 
dqb1-68 NH2 - TTG CAG CGG CGA CCC CAT CCA GGA C 
dqb1-69 NH2 - GAA GTA GCA CAG GCC CTT AAA CTG G 
dqbl -70 N H2 - GAA GTA GCA CAT GGC CTT AAA CTG G 
dqb1-71 NH2 - TCG ACA GCG ACG TGG GGG TGT ACC G 
25 dqbl -72 NH2 - TCG ACA GCG ACG TGG GGG AGT TCC G 
dqbl -73 NH2 - TGT GGG CTC CAC TCG CCG CTG CAA G 
dqb1-74 NH2 - CGG CGT CAG GCC GCC CCT GCG GGG T 
dqb1-75 N H2 - TCG ACA GCG ACG TGG AGG TGT ACC G 
dqb1-76 NH2 - GCG TTG GAG GCT TCG TGC TGG GGC T 
30 dqb1 -77 NH2 - CGG TGA CCC CGC AGG GGC GGC CTG A 
dqb1-78 NH2 - ATG GGA CGG AGC GCG TGC GTT ATG T 
dqb1-79 NH2 - CGG TGA CGC CGC TGG GGC GGC TTG A 
dqb1-80 NH2 - ACG GGA CGG AGC GCG TGC GTC TTG T 
dqb1 -81 NH2 - TGA TAA GGC CAA GCC CAA GGA AGA T 
35 dqb1 -82 NH2 - GAG ACT CTC CCG AGG ATT TCG TGT A 
dqb1-83 NH2 - CGT CGC TGT CGA AGC GCA CGT CCT C 
dqb1-84 NH2 - GAC TCT CCC GAG GAT TTC GTG TAC C 
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Homozvaotes 

(From CFT if available, otherwise greedy algorithm). 

DPA1 

I I I I I II I II II GCCCAGGGCACAG 
I I I I I I I I I I I I AAGGAAAAGGCTC 
I 1 I I I I I I I I I I I GGATCTGGACAA 

I I I I I I II I I I I CTGGCCCAGCTCC 

I I I I I I II I I I I 11 GTACAGACCCA 

I I I I I I I I I II I AGGGGACCCTGTG 
1 I I I I I I II I T TGGCGGACCATGTG 
I I II I I I II I M CTGCTCATCTTCA 
I I 1 I I I I I I I I I GTCAACTTATGCC 
I I I I I I I I I I I I I CAGGCCGCCAAT 
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DPB1 

1 I I I M I I I I I I CAACCGGGAGGAG 
I I M I I I I I I i I GGCCTGAGGAGGA 
i I I I I I I 1 1 I I I CAACCTGGAGGAG 
5 I I I M 1 I II M t I CCAGTACTCCTC 
I I I I I I 1 I 1 I I i I GCCGTAACTGGT 
1 I M I I I I I I I M GGGGCGGCCTGA 
I I 1 I I i M I M I GCGCGTACTCCTG 
[ i I I I I i I I I I I 1 GGACAGGAGGAA 

10 I I 1 i I I M I M I CACAGGAGGAGCA 
I i I I I I i I i I I I I I GCTCCTCCTGT 
I M i M I i I I i 1 GGCAATGCCCGCT 
I I I I I I I I I I I I GGCAGTGCCCGCT 
M i I I 1 I I M i lA GAGAATTACGTG 

15 I I I i I I I I i I I I I CCAGAGAATTAC 
I M I I I M I I I I AACTACGAGCTGG 
1 M I I i I I I I M GGTCATGGGCCCG 
I I 1 M 1 I I I 1 1 1 I GACCCTGCAGCG 
M I I I I I I I I M l A CACGTAATTCT 

20 I H 1 H I I I I I I GTAACTGGTACAC 
i i I I I I i I I 1 I 1 CTGA CGAGG AGTA 
I M I I I I 1 I I I 1 1 I ACCTTTTCCAG 
I I I 1 I I I I I I I I CCTGGAAAAGGTA 
I M 1 I I I I I I I I GAGAATTACCTTT 

25 i i I I i I i I M i i GCGTGACGAGGAG 
I I M I I I M I I l A CTGGTGCACGTA 
1 I I I I I M 1 I 1 I I CCTCCAGGATGT 
I I 1 1 I I I M I I I CGGGAGGAGCTCG 
1 ( I I M 1 I I i 1 l AGCCAGAAGGACA 

30 I I IN I I I I I I I CAGCCAGAAGGAC 
I I I 1 I I I I I I I lA GTGCCGGACAGG 
I I I I I 1 [ i I 11 l ATTGCCGGACAGG 
I I U I I I I i I I I CCTGCAGCGCCGA 
I I I M M i I 1 I I AGAGAATTACCTT 

35 1 i I I I 1 I I M I i GGACTCGGCGCTG 
I I M I I I M I I l ACTACGAGCTGGG 
i I I I m i I 1 I I GCTTCGTGCTGGG 
I I i I I I i I i I i I GTCCCTGGTACAC 
I I I M i I I 1 I I I GCGCTGCAGGGTC 

40 DQA1 

1 i I j I I I i I I I l ACATCCTCATCTG 
I H M I I I i H lA CACCCTCATCTG 
M I I I I i I I I I I CAAGTTTACACCA 
I H I I I I I 1 I i i CAGCCACAATGTC 

45 H I I I 1 I I I I I I i CCAAGTCTCCCG 
I I i I j 1 I I 1 I i I CGGGAGACTTGGA 
1 I I I I 1 I 1 I I I lAATTCATGGCTGT 
I I i 1 I I 1 I I M iA CAATCCCAGGGC 
M I I I I H I I I l A CAACCCCAGGGC 

50 I N I I I i I I I 1 I GTGGGCATTGTGG 
i I H I I M I I I I CCAACACCCTCAT 
TTTTTTTTTTTTG G CCCACAG ACAA 
M I i M I I I I i I GATGGGCATTGTG 
I I I I I I I I 1 I I I GGCCTGGATGAGC 

55 I I I I I I I I I i I l AGGGTGATCCAGG 
I i I I H i U I i i CAAGACCCTCATT 
I I M I 1 I j I I I I AGGACTGGGGACT 
I I I I { I 1 I M M AAGGGCCATTGTG 
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T 
DQB1 



T 



I 



DRB1 



T 



TTTAAATTCATGGGTG 
TTTCACCATAAGAGGC 
TTTCACCACAAGAGGC 
TTTCACGGTAAGAGGC 



I I i rTCCTCGCTTCTG 



nTTAACTCTCCTCAG 



TTTTAAATCTCATCAG 
TTTCTCCTCCCTTCTG 



TTTATCTTGCAGAGGA 

TTTCCTCTCCAGGATG 

nTGGGTCACCGCCCG 

nTGGGAGTTCCGGGC 

TTTCGCTCGGGTCCTC 

TTTCCAGTACTCGGCG 

TTTCTGGGGCCGCCTG 

TTTATGTCTACACCTG 



TTTAAAGGGCTTCTGC 

TTTAGCATCACCAGGA 

TTTGCCAGGAGGAGAC 

TTTAC GAG GAG GAG AC 

TTTGGTTTCGGAATGA 

TTTGGGTGTATGGGGT 

TTTGTCGGAAAGGGCT 

TTTTGGTTTCGGAATG 

TTTCGAGTACTCGGCA 

TTTAGCGCACGATCTC 

TTTGTCTCTTCCTGGT 

TTTCGTCAAGCCGCCC 



TTTGCGTCAAGCCGCC 
nrCAAGGTCGTGCGG 
TTTCGGTTATAGATGT 



TTTTGTAACCAGACAC 
TTTGTATGCAGACACA 
TTTCACACCCCGCACG 
TTTACACCCGGGACGC 



TGGAAGTGGTGCTG 



I 1 i i I GTGGTGGCGGT 

TTTGCAGAACGCGGTA 

TTTGGGGAGGTGGAGA 

TTTGCGGTTCCTGGAG 

TTTGAGGGAGAAGGAC 

TTTGACTGGGGTCTGG 



TTGCAGGAGTCGGG 
TTTGAAATAAGACTGA 



TTTTGGAGGACAGGCG 
TTAGGTGGTGGGGTG 



TTTTAGTGCAAGAAAG 
TTACGGTGTCCAGGT 



TTTGGAGAGGTTTAGA 
TTTGGAGTAGTCGGGA 



TTTGGAGTACTGTAGG 
TTTGTGTAAACGTCTG 
TTTGGGTGCAGCGGCG 
rGGAGGAGTTCCTG 



TTTTGGAAGACGAGGG 
TTTGAGGAGGTTGTGG 
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1 1 1 1 t 1 1 1 1 


I 1 1 OMOMOOV^O wOv^Ow 




1 1 1 1 1 1 1 1 1 






i 1 1 1 1 1 1 1 1 


1 1 1 f^f^AATPPTPTTf^f^ 




1 1 1 1 M i 1 1 


"TTTri CP A p A A A A A nr5 




1 1 1 1 J 1 1 1 f 


1 1 i APf^ I 1 1 PTTf^f^Afi 
1 1 1 MwO 1 1 1 w 1 1 




1 1 1 1 1 1 1 1 1 


i 1 1 Pf^f^APTPPTPTTfi 




i 1 M 1 1 M 1 


Mil APf^r^nTf^AnTf^T 




1 1 1 1 1 1 1 1 1 


\ 1 1 PPAf^f^Af^f^Af^TTn 




1 1 1 i 1 I 1 1 1 


"TTTf^TA ATTriTpn Ann 


1 f\ 

iU 


1 i M 1 1 i 1 1 


III! Pr^TAf^PnPf^Pf^T 




1 1 [ 1 1 1 1 TT 
1 1 1 1 1 1 1 1 1 


1 1 1 AAf^ATHPATPTAT 
I J 1 MMOM 1 OV-/M 1 V_/ 1 1 




■ r 1 1 1 1 1 1 TT 
1 1 1 1 1 1 1 1 1 


TTTT A P r5TPTf5 A riT(^T 

till MV-/0 1 w 1 OMO 1 w 1 




1 1 1 1 1 1 1 1 1 


1 1 1 PPAf^TAPTPAf^PA 




M 1 i 1 I ! 1 1 


"niCGTAGCGCGCGTA 


15 


1 1 M i 1 1 1 1 


TTTATCTCTCGACAAC 




i 1 M 1 1 1 M 


1 1 I GAGCTCCTCCTC3C3 




TTTTT TIM 
1 1 1 1 1 1 1 i 1 


TTTAACCAG G AG G AGT 




1 1 1 1 1 1 M 1 
1 1 1 1 1 1 1 1 1 


TTTAGGGCCCGCCTGT 




1 11 M 1 M 1 


'111 GGAGAGCTTCACA 




TTTTTTT I 1 
1 1 1 1 I 1 1 1 1 


TTTGGAGAGATTCAGA 




TT I 1 I 1 I 1 1 
i 1 1 1 i 1 1 1 1 


"TTTTCACCGCCCGGTA 




TTTT"! MM 
1 1 1 1 1 1 i 1 1 


TTTAACTAC CG G G TTG 




TTTTT 1 1 1 1 
1 1 1 1 1 1 1 1 1 


1 1 1 CCAGTACTGGGCA 




DRB345 




ZD 


M 1 M M M 


1 1 1 f^TATPTf^TPPAHG 




"f~rTTTTTT'l 
M M 1 1 1 1 1 


1 1 1 f^APTnf^f^(lTGGTG 




M 1 1 1 1 1 1 i 


1 1 1 PTr^TPf^AAGPGPA 




M 1 M 1 M 1 


1 1 1 \D 1 O 1 MMMww 1 1 W 




M M M M 1 


1 1 1 PTf^Tr^AAf^PTPTP 


irk 


M M M M 1 


1 1 1 pAppAf^nr^pppGC 

1 1 1 wMwVi/MO w www www 




M M M M 1 


1 1 1 f^r^PPAf^nTf^f^APA 

1 1 1 oO wwMw w 1 wwAAwrA 




M M M M 1 


Ml r^Pnf^TTPPTf^f^AG 

1 1 1 Owww 1 1 WW 1 




1 M M M 1 


"TTTTP A A P5 P P P GT 

i 1 1 1 L'OMMwwOwOww 1 




TTTTT "I I 1 
f 1 1 1 1 1 I 1 


1 1 1 TAAPPAHHAGGAG 

1 1 i i M/Aww/Aw\JA^VJ w/^vJ 


35 


1 1 1 1 1 1 1 1 1 


nTTACGTGGTCGGGTG 




1 1 1 1 1 1 1 1 1 


mTAGGGCCCGCCTGT 




M M M M 


1 1 1 GGGCCOoOO 1 o 1 0 




1 1 I 1 1 1 TT 
1 1 1 1 i 1 1 1 


TTTAACTACGGAGTTG 




1 1 1 i 1 i 1 1 
1 1 1 1 1 1 1 1 


TTTGGGGCCGGGCTGT 


40 


I 1 i 1 1 1 1 1 


■ 1 I 1 GACCATGTTTCTT 




i 1 i 1 1 1 1 1 


rTTTCTGTGCAGGAACC 




M M M M 


mTGGCCGGGCTGTTC 




1 1 1 M 1 M 
1 1 1 t 1 1 1 1 


TTTACATCCTGGAAGA 




1 i 1 1 1 ITT" 
1 1 1 1 1 1 1 1 


TTTCTCACGAGTCCTG 


45 


HLA-A 






M M M M 


1 1 1 M PAf^TPTHTr^AGT 




M M M M 


Mil A fSAPf^P AT ATfiAP 
1 1 1 MwMwwwM 1 M 1 V3Mw 




M 1 M M 1 


MM r^f^APf^PATATQAP 

1 1 1 VjOMwOwM 1 r\ 1 wMw 


50 


M M M M 


1 I 1 f^GTPGCCAGGTCC 


1 1 M i 1 1 1 


fTTTCCGGAGGCTCTCT 




M M M M 


1 i M CCTCCTCCACAT 




i 1 1 1 M M 


n ITCCGAACCCTCGTC 




1 1 1 1 1 1 1 1 


nTTATTTCTCGACATC 




1 1 1 1 1 1 1 1 


mTGGCGGACATGGCG 


55 


M M M 1 1 


m I CCAGAGCGAGGAC 




1 i 1 i 1 1 i 1 


Mill CACCACATCCG 




1 1 1 1 i 1 1 1 


mTGGGAGGCTGCGGA 




1 1 1 1 1 i 1 1 


[ i 1 1 1 GATGTGGAGGAG 
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I I I I I I I I I I I I GGAGGAGGAACAG 
1 I I I I I I I I I I l AGTCATATGCGTC 
I I I I I H I I I I I GGTCTGCCGGAGC 
I I I I I I I I I I I lA AACCTGCCATGT 
5 I I I I I I I I I I I I CCGGGACACGGAA 
I I I I I I I I I I I I CGTCCTGGGGGGG 
I I I M I I I I I I I CCGCTGCCAGGTG 
I 1 I I I I I I I I I l ATGCGTCCTGGGG 
I I I I I I I I I I I l ATGCGTCTTGGGG 
10 I I I I I I I I 11 I I GGAGAAGAGATAC 
I I I I I I I I I I I I GGGAGCCCGCCCA 
I I I I I I I I I I I I CCGCAGGTTCTCT 
I I I I I I I H I I I GCGGAGGTCCTCT 
I I I 1 I I I I I I I I GGGCGGGCTCTCA 
15 I I I I I I I I I I I I GCAGGACACGGAG 
I H I I I I I I I I I GCGGCAGTGGAGA 
I I I H I I I I I I l AGGAGACAGGGAA 
I I I I I I I I I I I I GTCAATCTGTGAG 
I I I I I I I I I I I l A GAAGTGGGTGGC 
20 I I I I I I I I I I I I CAGGTAGGCTCTC 
I I I I I H I I I I I CGGACGCCGCCAA 
I I I I I I I I H I 1 I CAATCTGTGAGT 
I I I I I I I I I I I I I GAAGGCCCAGTC 
I I I I I I I I I I I I CGTCGTAAGCGTC 
25 I I I I I I I I 11 I l AACCAGAGCGAGG 
I I I I I I I I I I I I I GACGGTCATGGC 
I I I I I I I I I I I I I GGACCTGGCGAC 
I I I I I I I I I I I I GAGAGCCCGCCCA 
I I I I I I 1 I I I I I I CATATTCCGTGT 
30 I I I I H I I I I I I GGGAGACACGGAA 
I I I I I I I I I I I I GTCCACTCGGTCA 
1 I I I I I I I I I I I CCGTGTCTCCCCG 
I I I I 1 I I I I I I I GCTGCCACGTGGG 
I I I 1 I H I I I I I CGAACTGCGTGTC 
35 I I I I I I I I I I I I GGTAGGCTCTCAA 
I I I I I I I I I I I lAGGTCCACTCGGT 
I I I I I 1 1 I I I I I GTCCTGGGGGGGT 
I I I I I I I I I I I I GCTGCTCCGCCGC 
I I I 11 I I I I I I I GGGGCGCCATGAC 
40 I I I I I I I I I I I I GCGCGATCCGCAG 
I H I I I I 1 I 11 I GCACATGGCAGGT 
I I I I I I I I I I I l A GGAGAAGAGATA 
I I I I I I I I I I I l AGGAGCAGAGATA 
I I I I I I I I I I I I CCACTCCACGCAC 
45 I I I I I I I I I I I I CGCGTCCAGGCAC 
I I I I I I H I I I I CACGTGCCATCCA 
I I I I I H I I I I I CCCGGCCCGGCAG 
I I I I I I I I I I I I CACGTCGCAGCCA 
I I 11 I I H I I I l ACGTCGCAGCCAT 
50 I I I I I I I I I I I l ACGTGGCAGCCAT 
I I I I I I I I I I I l ATCCAGAGGATGT 
I I I I I I I I I I I I CGAGCTCCGTGTC 
I I U I I I I I I I lA CCAGAGCGAGGA 
I I I I I I I I I I I lA TGAACAGGAGGC 
55 I I I I I I I I I I I I I CACACCCTCCAG 
I I I I I I I M I I I CTACGTGGACAAC 
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HLA-B 

I i i I I I I I U I I GGATGGCGCCCCG 
I I I I I I I I I I i i CGGCTCAGATCTC 
I t I I I I I I I 1 I 1 I CGGGGCGCCGTG 
5 i I I M I i I I I I I CTCCACTGCTCCG 
I I I 1 I I I i I M I I GTGTTGGTCTTG 
1 I i I I M I 1 I I I GGGTATGACGAGT 
I I I I } I I I I I I M CGAGGTGATGTA 
I M I I I I I I i I I GTGCTGCTCCGCC 

10 I I I I I i i M I I i I GTAGTAGCGGAG 
I I I i i I M I I 1 I GCTCAGGTGGTCC 
i M I I M I M I l AGGAAGACAGAGA 
I I I M I I I I I I I GCGTCGTAGGGGT 
M i I I I I I M 1 I GTGAGGGTGGGGA 

15 I I I I M 1 I I I M AGATGATGCAG AG 
I I I I I I I I 1 I i I GGTTGTGTGGGTA 
I I I I I i i I M I 1 i GATGTGTCTCTC 
i I I I I I M I I M GCGCCATGACCAG 
i M I I I I I I I I I GGCGTCCTGGTCA 

20 I i I I I I i I I I I 1 AGGAGGACGTGAG 
I i 1 i I I M 1 I I I GCGCCAGGCAGAG 
I I I I I I 1 M I I i AGGAGGGGGGGGA 
M i j i I I I I I I I GCGCTGCTGCGCC 
M i I 1 I M I I i lA GACCATCCAGAG 

25 I I I I M 1 H 1 I i GACACAGATCTAG 
[ I I M I I I I I I I GGGCATGAGCAGT 
I M I I M I i I I I GAGAGAGATCTCG 
I j i I i I I i I i I I GCGAGTGCGTGGA 
I I I 1 I I I I i I I I I GGTAGCCGGGGA 

30 I I I I 1 I I I I I I 1 GGTGTGCGTGGAG 
i I I I I I I I I I I lA GACACAGATCTT 
M I I 1 I i I M I i CAGCGACGGCAGG 
M 1 I I I I I I I II GGGGCCGGGACAC 
TTTTTTTTTTTTGCCGTGCCAATAC 

35 I I i 11 II i li M GGGCATAACCAGT 
i I I I I I I I I M I GCGCCGCTTCATC 
I I I M I I I I M I GAGGAGCGGAGGT 
I 1 I I I I M I I I i CGTCCACGCACAG 
1 I I i I 1 I I I I I I GAGTCCGAGAGAG 

40 I I i I i 11 It 1 M GACACAGATCTCC 
I I M I I I i I I I I [ AACCAGTTAGCC 
I I n I i I I I li I l AGGCGTGCTGGT 
I I j I i I I i I i 11 GACCCTGCTCCGC 
i 1 I I I I I I II II GGGGCTCCGCAGA 

45 I I I I I M M I I I CCGGTCCCAATAC 
I 11 I I I M l i I I GCGGGTCACGGCG 
j 11 i li i I I I i lA GGGCCAGGGCTC 

I I i l i I I li I I lA TCCTCTGGAGGG 

II I 1 1 I I i I M I GGCAGACGATGTA 
50 l i I I M i I I I I I AGGCGGAGCAGGA 

I I I 1 I I I 1 I I I I CAGCTGCTCCGCC 
I i I I I 1 I I I M l ATCTGCGGAGCCA 
I M I I II I I I I I CGGAGCTGTGGTC 
1 I i II I I i i I M CGACCACAGCTCC 
55 i I I i I I I i I I M GAAGAGTTCAGGT 
I I I I It 1 I I I i I CATGTCGCAGCCA 
M t I I I II I I I I CTGGGCTGGCTCC 
t I 1 1 I M I I i I I CAACACACAGACT 
I I I I n I I I H 1 I GGCGGAGCAGGA 
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I I I I I I I I I I I I l ATGACCAGGACG 
I I I I I I I I I I I I CCACTGCTCCGCC 
I I I I I 1 I I I I I I A TGACCAGGACGG 
I I I I I I M I I I I GGAGGGGCCGGAG 

5 I I I I I I I I I I I I I GCGTGGACGGGC 
I I I I I I I I I I I l A GATCTGTATCTC 
I I I I I I I I I I I I GCGGGTCATGGCG 
I I I I I 1 I I I I I I CCGGGACATGGCG 
I I I I I I I I I H I GCACAGCTGTCCA 

10 I I I I I I I I I I I I CGGGACATGGCGG 
I I I I I I I I I I M CCCGTCCACGCAC 
H I I 11 I I I I I I GAAGTGGGAGCCG 
I I I I 11 I I I I I I I I CCCAATCCAGG 
I I I I I I I I I I I I CGGACGATGGGGA 

15 I I 1 I I 1 1 III I I I I CGGAGTCGACC 
I I I I I I I I I I I I GAGATCTGAGGCG 
I I I I I I I I I H I ICGACGGAGTGGG 
I I I II I I I I I I I GAGAGCGACGGCA 

I I I I I M I I II I GGCGGCGGAGACC 
20 1 I II I I I I I I I I GTAGGAGGAAGAG 

II I I I I I II II I CTTTTGGACGTGA 

I I 1 I I I II I I I I CACGTCGGAGCCA 
I I I I I I II I I I I CAGGTCGCAGCCA 
H I I I I I 1 I I II GGTAGCGCAGTGC 
25 1 I I I I I I I I H l ATCGAGGTGATGT 
I I I I I I ill I I I I CGGAATGGACGG 

I I I 11 I I I I I I I GGGCGGTTGCTCC 

I I I I I I I II II I GGCGGTTGATGGG 

I I I I I I I I II I I CCGCGGTTGATCG 
30 I I I II I II I I I I CACACAGACTTAG 

H I II I I I I I I l AGGACGGTTGGGG 

II H I I I I I I II CGCGGAAGCGTCC 
1 I I I I I I I I I I I GAGGTCTTGGTGG 

I I I I I 1 I I I I 11 GCTGCCGAGAGCA 
35 II I I I I I I 1 I I l ACTCCATGAGGCA 
I I I I II I I I I I I GCTGTGGTGGTGC 
I I I I I I H I I I I I I GTCCAGAAGGC 

I I I I II 1 1 I I I I I GGCCGCGGAGGA 

II 1 I I I I I I l i I GCCGCGGACAAGG 
40 I I I I I I I I I I I I CCGCGTTGTCGGC 

I I I 1 1 1 I I I I I I CGGGTAGCACCAG 

HLA-C 

I I I I 11 I I I I 1 I I GAGCTGGGAGGC 
I I I I I I I I I I I I GGTGCAGGGGTGC 

45 I I I I I I I I I I I I GGGTGCAGGGCTG 
I I I I I I I I I I I I GAGGCGGAGCAGC 
I I I I 1 I I I I I I lA CGGCGGAGCAGC 
I I I I I I I I I I I I GCGGCGGAGGAGG 
I I I I li I I I I I lA GCGCGCGGAAGC 

50 I I I I I i I I I I I I CGGCGCAGGTCTC 
I I I I I I I I I I H I GGCTCGGAGCTG 
I I I I I I I I I I I I GCGGGCGGAAGCG 
I I I I I I I I I I I lA GGGCTTCCATCT 
I I I I I I 1 I I I I I GGTTGGGGGCTCC 

55 1 I II I I I I 1 I I l A CTCCAGGCACAG 
I I I I I I I I li I I I GGAGCAGGAGGG 
I I I I I I I I I I I I GGGCGCAGAACCC 
I I I I I I I I I I 1 I I GAGTCTCTCATC 
I I I I I I I I I I I I GCTGGAGGCGCTC 
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I I I I I M 1 1 { I I CCGCCGTGTCCGC 
I I I I I I I I I I I I CCGCTGTGTCCGG 
I I 1 I 1 i I I I I I I t CCAGAATATGTA 
I I I M I i I I I i I CGGGGAGCCCGGC 
5 I M I 1 I I I 1 I I I GCCGTCGTAGGCG 
H M I I I I I M I CGGGCAGGCAGAG 
1 M I 1 I I I I 1 I I GCGGCAGGCAGAG 
1 i I I I I I 1 I M i GTAGGCGCGGAGG 
1 I I i M M M I I GCTGGACGGAGCG 

10 I 1 i [ i I I I I I M IGCAGTGGATGTA 
1 I M I I I 1 I I I I I CCACGGACAGGC 
I I I I I I I I i I I I GGGGTGTGGGCAG 
TTTTTTTTTTTTG AGGG G AG GGGCG 
I M I I I I M I I I CGTGTCCCGGCCT 

15 I i I M I i I i M I GGCATGACCAGTT 
I I I I I i I i M M GGTATGACGAGTT 
I I I i I I I I I I I I GACAACCAGGACA 
I I I I I M 1 I 1 I I GAATATGTATGGC 
I 1 I I i I M I I I I GACAGGCAGGACA 

20 I I I 1 I I H I M I CTGGCTGTCCTGG 
I M I M I M I I I CTCCTAGGACAGC 
i I I i I i 1 i I I i lA GGGCCAGGGCTC 
I I I I I I H i i I I lA TAACCAGTTCG 
1 I I 1 I I I I I I i I CATAGGAGGAAGA 

25 I I I I M M M I I I GTGGAGACCAGG 
I I I M I I I I I I I I GCTCTTCTCCAG 
I I I I I i M M I I GAAGAATGGGAAG 
I I M i I I I I M I I GCGGAAACTGCG 

16S rRNA 

30 I 1 I I I I I I I I M l AGCCGCCTGCGT 
M I M I I I I I I I GGCCGCAAGGCTG 
I I i I M I I M I I GAACTGCCGTTGA 
i I I i i I 1 I I I I lA GACTGCCGCTGA 
I I i I I I i i 1 M i I iA TTCGGAATTA 

35 I M H I I I 1 I M I iGCACCCCTTGT 
I M I I I i i I I I I CGCGAGGTTGAGC 
I M 1 I I I M I I I I ACCCCCCATTGT 
I I I I I I I I I I I I CATTTGATACTGG 
i I I I I M 1 I 1 1 I GTGTGCCTAATAC 

40 I i i 1 i I I I M U lA CGACTTAACCC 
I I I I I M I I I I I CCCGGCCTTTGTA 
TTTTTTTTTTTTGGGCAAACTGGAG 
H I M I I I I I I I GATTTGATCCTGG 
I I I I I M I I I M I GACTCCCGAAGG 

45 I 1 I I i I 1 I I M I GAAGTCGTAGCAA 
I I i I I I I 1 I i I I CGCTGCAGAGATG 
M I I i I I I I I I I lA CCCTACCTACT 
1 I I I I M i I I I I GAGGACCTTCGGG 
M I 1 I 1 1 I I 1 1 l AAGGGCCATTACC 

50 } I 1 I I M I I I I I GATAAACGCTGGC 
I 1 I I M I 1 I i I I GACTAGCTACTCC 
M i i i I i I I I 1 lA CATCCGGTGTTA 
I I i I I I I i i M l A TCGCAGGCCTTG 

I I I 1 I 1 I I I 1 I I I CACCAAGTCGCT 
55 M I I i I M I I M I CCCTCCTTTCGG 

I I i I M I 1 I H I 1 I l AAACGCTGGC 

I I i M { I I N I I CGAAACCGCAAGG 
I I I i i I I I I M I GCAAGCGTCCTCC 
I i I I I I I 1 I I i lA CCAAGGACGTTT 
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i M I I M I I M i CTAATACCCGGAG 
I I I M I I I I I I l ACTTTCAGTGGGG 
■ I I 1 I I M I I I I i CTGCGTGAAGTCG 
I i I [ M I I 1 i I l AATAGCCCACCAA 
5 i I i I I M I 1 I I lA ACGGAAACGGGG 
I I I i I I I I I I I I GGATTGCACTCTG 
1 I M I 1 1 I I I I i i AGCCTTGGGGAG 
I I I I I I I M 1 I I CGCGGCATGGCTG 
M I M I M I M I GCATAAGGGGCAT 

10 i 1 I I I M M 1 I I lA CCACATCTCTG 
I I I i i I I I I I I i GTTACCGCGAGGA 
I M I I I 1 I I I I I GGCTTTCAGAGAT 
I I I I I 1 I i I I I I CGCTGCTTCGCTG 
I U I i I I I I 1 M I AGCGCTACCTTG 

15 I ! I 1 I I I 1 I I I I GCACCACCTGTCA 

I I I M I i I I 11 M GAGTnTAACCT 

i i 11 li I M 1 II CTAATAGGGGATA 

II 11 M I I I I I l AGGAGAAAGCTTG 
M I I I M M II I 1 I A AGAGATTAGC 

20 M 1 I 1 II i 1 M I GTAGCATTCTGAT 

I I 1 1 1 1 i 1 I I 1 1 AGGCTTTCCCCCA 
TTTTTTTTTTTTAG AAGTAG CTTGC 

I I 1 M 1 I I I I M r CGCGTATCATCG 
I I I I I I 1 It II I II CAGAGATTAGG 

25 i I I I I I I I I I II I CCGAAAGCGTGG 
1 1 1 I I I I 1 I I M l ACAACCCGAAGC 
I II I M i 1 I I I 1 I GTCATGGCTCAG 
1 M I I 1 1 1 I I M CGTAGGCTTGGTG 
1 M I M i I I I I I GTGGAATTCCACG 

30 I M i I I I II I I lA CGGTTCCCGAAG 
I I II I I M I II lA ACTGGAGTGCGT 

I 1 1 II I I M I I I I GATGTGCTATTA 

I I 1 I 1 I I M I I 1 AAGCAGGGAGGAA 
i M I 1 I 1 1 1 1 I 1 CTGCTGGAGTGAA 

35 I 1 1 1 1 I 1 1 1 I 1 1 I I GGGATTAGCTC 
T I I I I M I I I I I CCTTTGATACTGG 
I I M I I I I I I I I GGACGCTAGCGGC 
1 1 I 1 M i I 1 I I 1 GTTTACTACCCAC 
I N 1 1 I 1 M 1 1 I CGCGATCTCTAGC 

40 M 1 li I 1 1 1 1 M l AGGCCGTTCCCC 
I 1 I 1 1 I I I i I I lACGCGTTGGATCG 
VI I I 1 I 1 I 1 I 1 I GCCCGTCAAGCCA 
I M M I I I I I I lA GTCCCCGCCATT 
I I I I I M I I I I I CTAGCCGTAAGGG 

45 I 1 I I I 1 1 M I 1 I 1 GTCCTTCGGGGG 
I M 1 I 1 M 11 I 1 AACCAACTCCCAT 
T M I I 1 1 I 1 I I I ACTGTGGGTAATA 
M I M I I I I I I I CTGAAAGATGGCG 
I I I I I I I I I I I I CGAAAGCCAGGGG 

50 i I M I I I I I I II GTCCGGAATTCTG 
I M 1 I I I I I I I I CAGAAGTGGGTAG 

I 1 1 1 1 i I 1 1 1 I li CAGTCCTCATGG 
TTTTTTTTTTTTGAAAGAAGCTTGC 
M 1 1 I 1 11 i 1 1 1 GACCACCTGTCAC 

55 I I I I I i i I 1 M I I I I GGAACTGCAT 

I I I I I I I I I I I lA CAGTTCCGGAAG 

I I i i M I i M 1 I CTCATATCTCTAC 

I I 1 1 I II 1 1 1 1 I TTCAGTGAGGAAG 
I I 1 M i I I 1 1 1 l ACTGTGAGGAAGG 

60 1 I II I I I i I I I I CCCAGCCCGTAAG 
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I I I I I I M I I I I CGTAGCCTTGGTG 
i I 1 I 1 I I I I i I l ATGATGCGTAGCC 
I I I M I I I I I I I AGGCAGTGGCTCA 
i I I I 1 I I 1 I M I CAGGACTTAACCC 

5 I 1 I i I I I I I I i I GGCCAGGCCGTAA 
i I I I i I j I I M I I CCAACTTCGTGC 
I I I I I 1 I I I I I I GAAGCGTGTGTGA 
j I I I M I I I I I I CTCCCCCGAAGGT 
I I M I M I I I I l ATGGGAGTTTGTT 

10 i M I I i i I I I I I GTGTGCCGTTACC 
I I I I i I I 1 I I 1 l A GCAGTGAGGAAT 
I M I I I I I 1 I I i GCCCCGGTTAACT 
M I I I I I I I I 1 I GCAGCGGGAGTCA 
I 1 H I I I I I I I I GGACCTTCCTCTC 

15 i I I I i I I I M I I ACCTAGGTGGGAT 
M 1 1 M I I 1 M l AATAGCTAATACC 
I I I M I I ill M GCCATATCTCTAC 
1 M I I 1 i I i I I I GCCGGTGGGGTAA 
I I I i i I I I I I I I l ACCCCACCTTCG 

20 1 I I I M I I I I I I CAAGGCCTGGGAA 
I I I I 1 I I M I I I CAACCCTGGTGGC 
I I I I M I I I I I I CTAGTCATCCAGT 
I I I I i I 1 I M M GGCTGCTGCCTCC 
I U i I I i I 1 i I 1 CCGAGAGGTCAAC 

25 M I I M I I I i I I GAAAGGTTGATGG 
I I I i I 1 I I I I 1 l AACAGGGTGGGAA 
I I I M I I I I I I i GAGCTTGGTGGGG 
I I 1 I I I I I I H l ATTTAGTTGAGCA 
I I I I I 1 I I I I M GGAGTTAGGCTCA 

30 I i i I 1 i I I I M I I I GATGTGGTATT 
I I I I I I I i I I I I GTTAGGTGGGAGG 
I 1 i I I i i M M I GGCTAGAGATGGT 
I I I M I I I I I I lAACTTGCGTGGAT 
I 1 i I M 1 I i 1 I I GCGATTACGTGAA 

35 M I I H I I 1 i I I GGACGTTGGGGGC 
I I 11 I I I i I I I I I GGTGGAGGATGT 
li 1 1 H I M II lA TAAAGGATGGGG 
1 1 I I I I II I I I lA AGAAGTGGGTAG 

I U I I i I II M lA AGAAGGTAATGC 
40 I I I 1 1 I I I I I I II CGATGGTTTGAG 

- I M I M 11 II I l AGTAAGTGGCGGT 

I I I M M li 1 II CAAAAGGGGGGGT 
1 1 11 1 I 1 1 I I II GGCGGTTGGGGTG 
i M I II I I I H I GGTACGTAGGTGG 

45 I I I I I I I I I I II I GCGAGGTGGAGC 
M 1 I I 1 I 1 I I I I CGCGAGGTGGAGG 
I I I I 1 I I I M I I GGTACCTACTTCT 
1 1 I I M I I I M I I lA AGACATAGAA 
I I I I I I I I M I I IGTTGTGAAATGT 

50 I II I I II I I n I GGTAAAAGTGAAA 
I I I I I I I I I I II IGAAGGGGGAAGT 
1 1 I I I M 1 I I I I I CCAACCTTGCGG 
I II I 1 I M I I I I GGAGGAACGTGGG 
I I II I I I M I I l ATAAGCCTGTCAG 

55 n I I U M I I I I lA TGCTAATGGCA 
I I I I I I I I I I I I GATGCTAATCCGA 
I I I 1 I I I I I I I I GCCAGTGTTGGTG 
I II I 11 1 I I I M GTAAAGGTGGGGA 
M I I I I 1 I II M M AACACACCGCC 

60 I I I I I I M I I I i CCAAGGCGGTGAT 
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I I I U I I I I I I I GCTACGGCTAACT 
I I 1 I I I I I I I I l AGTCGAGCACTCT 
I I I I I I I I 1 1 I lA AGGGTAGCTAAT 
I I I I I I I M I I I GTCAGAGTACGAG 
5 I I I I I I I I I I I I I GAAAGCACTTTA 
I I I I I I I I I I I I GGGGCAAGGCTTA 
I I I I I I I I I I I I GCCTAGGTGGGAT 
1 I I I I I I I I I I I GTCCCCACGTTCC 
I I I I I I I I I I I IGGCCACAAGGGGA 

10 I I I I I I M I I I I CTAGCTGTAGGGA 
I I I I I I I I I I I I GTGGGCAGCAAGG 
I I I I I I I I I I I I I CGAAAGATTAAA 
I I I I I I I I I I I I GGAGTATGGTCGC 
I I I I I I I I I I I I CGAGATGTGAAAG 

15 I I I I I I I I I I I I GGGCAGGCTAGAG 
I I I I I I I I I I I l ACCTCCTGAGCCA 
I I I I I I I I I I I I I CCACCGGTACAC 
I I I I I I I I I I I I I I I CAGTCTTGCG 
I I n I I I I I I I I CTTGACGGGCGGT 

20 I I I I I I I I I I I l ACGGTAAAAGATG 
I I I I I I I I I I I I I 1 CACCCTTGCGG 
I I I 1 I I I I I I I I lA ACCAGAAAGCC 
I I I I I M I I I I I CAACCAGAAAGCC 
1 I I I I I I I I I I I GTGTCAAAGGCAG 

25 I I I I I I I I I I I I lA AGTCCGGATTG 
I I I I I I I I I I I I GCGACATGCTGAT 
1 I I I I I I I I I I lA TCAGCCTGCCGC 
I I I I 11 I I I I I I GTCGGTAGGGTAA 
I I I I I I I I I 11 I GTCGGTGGGGTAA 

30 I I I I 11 I I I I I I CAACTCATAAGGG 
I I I I I I I I I I I I I I CACTGCTTAAA 
I I I I I I I I I 1 I I CGCCAGTCCCACC 
I I I I I I I I I I I I CTAGTCATAAGGG 
I i I I I I I I I I I I CACTGATTTGACG 

35 I I I I I I I 11 I I I GGCCACAGAGGGA 
I I I I I I I I I I I I I I IGCCCCATTGT 
I I I I I I I I I I I I I GACCAGAAAGGG 
I I I I I I I I I I I l ACACTGGGGGATA 
I I I I I I I i I 1 I I I CAGCCGCCTTCG 

40 11 I I I I I I I I I I GTCGCCAGCTCGT 
I I I I I I I I I I I I CTCATATGAATTG 
I I I I I I I I I I I I I GTAAAGGGAGCG 
I I I I 1 I I I I I I I CGTAAAGGGAGCG 
I I I I I I I I I I I I GGCGGCTCCCTCC 

45 11 I I I I I I I I I I CAGATGTTCCTCC 
I I I I I I I I I I I I GTCTCACGACACG 
I H I I I I I I I I I I CAGCCGGCTACG 
I I I I I I I I I I I I 11 GTGGTAATACG 
H I I 11 I I I I I I CTTGGAACTGCAT 

50 I I I I I I I I I I 1 lAGTACTCACCGGT 
I I I I I I I I I I I lA TTGCTCCATCAG 
I I I I I I I I I I I I I GATCCTGAGCCA 
I I I I I I I I I I I lA GGAAGTAGAACG 
I I I I I I I I I I 1 1 I GCAAGTAGAAGG 
55 I I I I I I I I 1 I I I GATAACC GCAA GG 
I I I I I I I I I I I I GCAAGCGTTTTCC 
I I I I H I I I I I I GAATACCTCCTTT 
I I I I I I I I 1 I I l A CAGAGCTTTACA 
I I I I I I I I I I I I I GTCCTTCGGGAG 
60 I I I I I I I I I I I l AGGCGGCTTGCTG 
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Heterozygotes 



DPA1 



From CFT if available, otherwise greedy algorithm. 



10 



15 



20 



25 



30 



TTGCCCAGGGCACAG 
TTCTGTTGTTCTATG 
TTAAGGAAAAGGCTC 
TTATGAAGATGAGCA 
TTCACCCTCAGTGAC 
TTGTCAACTTATGCC 
TTGCAGGAAGAGGCT 
TTTTTGTACAGACGC 
TTCGGTCTCCTTCTT 
TTGCAATGGGGAGCC 
TTTGGATCTGGATAA 
TTTGATGAAGATGAG 
TTTGTTTGTAGAGAG 
GTACAGAC 



TTCGTT 
TTCTCAGGCCGCCAA 
TTCTCAGGCCACCAA 
TTATGTGGATCTGGA 
TTACACTCAGGCCGC 
TTCACACTCAGGCCG 
TTTCAGGGCACCAAC 
TTCGTGTGTACAAAC 
TTAGAACATGTGATG 
TTAGAACTGCTCATC 



TTTTGAATTTGATGA 
TTTTGAGTTTGATGA 



DPB1 



35 



40 



45 



50 



55 



n 



TCAACCGGGAGGAG 

TCAACCTGGAGGAG 

TCATCCTGGAGGAG 

TTGCTGGGGGGTCA 

TGGCCTGACGAGGA 

AACTAGGAGGTGG 

TTCCAGAGAATTAG 

TTGGCGTAAGTGGT 

TTCCAGTACTCCTC 

TAGTGCCGGACAGG 

TACCCCCCAGCAGG 

TAGAGAATTACGTG 

TTCCAGTACTCCGC 

TGCATTCCTGCCGT 

TCGGGAGGAGCTCG 



TCAGCCAGAAGGAC 

TATTGCCGGACAGG 

TCTGCAGCGCCGAG 

TGCGCGTACTCCTC 

TACAGAATTACCTT 

TTTAAGTGTACCAG 

TATCCTGGAGGAGA 

TGGTCATGGGCCCG 

TGGGAGGAGTACGC 

TTGGGGCGGCiCTGA 



wo 00/65088 



PCT/EPOO/03636 



-49- 



TAAAAGGTAATTCT 
TCTGCCGTAACTGG 
TTTGTGTCTGCATA 
TGGCTGTTCCAGTA 
TGTCCCTGGTACAC 
TCCTGCAGCGCCGA 
TTCTTGGAGGGGGA 
TGAGGTCCTTCTGG 
TCAACCGGCAGGAG 
TTGTGTCTGCATAC 
TCGGGAGC.AGTTCG 
TTGACCCTGC-AGCG 
TCAGAGAATTACCT 
TTGGGTAGAAATCC 
ACGTGCACCAG 



TT 

TCGCTGCAGGGTCA 
TAGCCAGAAGGACA 



TGTTCCAGTAGTCC 
TTTGGCCTGCTGCGGA 
TTGCAGCGCCGAGG 
TACTACGAGCTGGT 
TCTGGGGCGGCCTG 
TACAGCGACGTGGG 
TTGCCGGAGAGGAT 
TCTGCCGTCCCTGG 
TCATGGGCCCGACC 
TGTCCGATTAAACG 
TGTAACTGGTACAC 
TAAGGACCTCCTGG 
TCTCCTGGAGGAGA 
TGAGAATTACGTGT 
TCCTGATGAGGTGT 
TCACAGGAGGAGCA 
TTGCCGTCGCTGGT 
TGGGAGGAGTTCGC 
GGACAGGAGGAA 



TACCCTGCAGCGTC 
TCCGCCCGGAACTC 
TGCTGCAGGGTCAC 
TACAGGACTATCCA 
TGCGTACTCCTGGC 
TCCGTAACTGGTGC 
TGCAGGAATGCTAC 
TCCAGGCAGCATTC 
TAACCGGGAGGAG 
TTGGGCTC.AGGCGGA 
TACTACGAGCTGGG 
TATGAGGTGTACTG 



TATACATCTACAAC 

TTAACTGGTACACT 

TCACGTAATTCTCT 

TAGCATTCCTGCCG 

TACTGGTACACTTA 

TGGCAATGCCCGCT 

TGCTTCGTGCTGGG 

TCGCCCGGAACTCT 

TACAGGACTGTCCA 

TTCCTCCAGGAGGT 

TCCTTCTGGCTGTT 

TGTTCCAGTAGTCC 
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TTGCGCTGCAGGGTC 
TTAACCTGGAGGAGA 



TTTTCCTGCCGTAAC 

TTACGCTGCAGGGTC 

TTCCACAGAATTACC 

TTCCAGAGAATTAGG 

TTCGCCGAGTCCAGC 

TTAACAGGCAGGAGT 

TTTCGTCCAGGATGT 

TTAACCGGCAGGAGT 

TTCTCCAGAGAATTA 

TTGTTCCAGTACACC 

TTCTCCTGTAGGAGA 

TTTTACCTTTTCCAG 



TTGGAGGAGTTCGTG 
TTGAGGAGCTCGTGC 
TTGCCGTAACTGGTG 
TTGCCCGCTCCTCCT 
TTCGTGCCTGGAAAA 
TTGCCGTCCCTGGAA 
TTCCCCTCCAAGAAG 



TTGCTGCCTGGGTAG 

TTTCCAGTAGTCCTC 

TTATTCCTGCCGTAA 

TTCCTG G AAAAG GTA 

TTCGTCCCTGGTACA 

TTCTCCTCCAGGAAG 

TTTCTGATTCTGCCC 

TTATCTGGCTGCTGG 

TTGAAGGACAACCTG 

TTCGTGCACCAGTTA 



TTCGGACAGGGTATG 

TTCGGACAGGATATG 

TTGCACTCGGGGCTG 

TTACACGTAATTCTG 

TTGGTAACTGGTACA 

TTAATGAGGCCCCAG 

TTTCTCTCCAGGAAG 

TTCAGCGACGTGGGA 

TTTCCTGCCGGTTGT 

TTGAAGGACATCCTG 

TTGAAGGAGCTCCTG 

TTTGTTCCAGTAGAC 

TTCAGAAGGACAACC 



TTGCCTGATGAGGTG 



DQA1 



TGACAAGAGGGAAG 

TCATAAGAGGGAAC 

TGAACAC.AGGCAAC 

TACATCCTCATCTG 

TGAGTGCCCATTGC 

TCAGCCACAATGTC 

TACAATGCCAGGGC 

TAGAACCCCAGGGG 

TGTGGGGATTGTGG 

TATGGGGATTGTGG 

TCCAACACCCTCAT 

TAGACTGTGGTCTG 
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TTI 



10 



15 



20 



25 



30 



35 



TTCCAACATCCTCAT 

TTGGCCCACAGACAA 

TTC ATG G G C ATTGTG 

TTAAGATCGTCATCT 

TTCAACACCCTCATT 

TTGACTGTGGTCTGC 

TTAGCACTGGGGACT 

TTCTTAGATTTGACC 



I I I I lAGATTTGACC 
TTCGATGTTCAAGTT 
TTCAATCCCAGGGCG 
TTCCTCGGATGATGA 
TTTCCACATAGAACT 
TTAAATTCATGGGTG 
TTCAGCCACAATGCC 
TTCACCATAAGAGGC 



TTTTCCTCCCTTCTG 
TTAACTCTCCTGAG 



TTTAAATCTCATCAG 

TTCTCCTCCCTTCTG 

TTGTCAGCCAGAATG 

TTTCATTCCTTCTTC 

TTCTTCCTCCCTTCT 

TTATAACTCTCCTGA 

TTGAGGCTCATCCAG 

TTCAGGCTTGTCCAG 

TTATGTTGACCACAG 

TTAGTGCCCACCACA 

TTGAACATCCTGATT 

TTGGACCTGGAGAAG 

TTCCCTGTGGGCAGT 

TTGCCTCTGGGR-AGT 

TTTTACACCGTAAGA 

TTAGAAGATTTGACC 

TTGAACTGGCCAGAG 

TTGCTACAACTCTAC 



TTCAGTCTTACGGTC 
TTCAGTCTTATGGTC 



40 



DQB1 



45 



50 



55 



60 



TATCTTGCAGAGGA 

TGGCTGGGGTGCTC 

TGGGTCACCGCCCG 

TCTGGGGCCGCCTG 

TCTCGGCGCTAGGC 

TGTATGTG GTC ACA 

TAAGTACGAGGTGG 

TCCAGTACTCGGCG 

TCGGTTATAGATGT 

TGCAAGTCCTGGAG 

TTGGACACAACGCC 

TCTGGGGCTGCCTG 

TGGCCTTAAACTGG 



TTGTGTCTGCATAC 

TGTCGGAAAGGGCT 

TGGGTGTATCGGGT 

TGCAGTACTCGGCA 

TGTAGACATCTCCA 

TAGGAAACGGGCGG 
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TCACACCCCGCACG 
TTCCGCTCGGGTCC 



10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



TAGCATCACCAGGA 

TCCAGTTTAAGGGC 

TATAGCCACAAGGA 

TGTATGCAGACACA 

TTCCAGTACTCGGC 

TAGCGCACGATCTC 

TGGACATCCTGGAG 

TTGGGGCTGCCTGA 

TGTCAGAAAGGGCT 

TCAGGAGCCCTTTC 

TTGTCTCTTCCTGG 

TACACCCCGCACGC 

TTGGTTTCGGAATG 

TAACGGGACAGAGC 

TGCTGGGGCCGCCT 



TGAGGATTTCGTGT 

TGAGAGGAGTACGC 

TCACATCAAAGTCC 

TGCCAGGAGGAGAC 

TGTACTCGGCGGCA 

I I 1 t 1 I I I GGCCAGTTGTCTG 

TAGGGGGGTGGACA 

TAGATGTATCTGGT 

TTGGGGGAGTTCCG 

TTGTGTGGTCCTGG 

TGACACTCTGTCCA 

TGGAATGATGAGGA 

TATGGGGTCGCCGC 



TGAGATCAAAGTCC 
"TTT TT I I I AACGGGACCGAGC 
TAGGAGTACGTGCG 
TATGTGAGGAGATA 
TAGGGGGGGCCTGT 
TGGCGGGTTGTCTC 
TTGTAAGGAGAGAG 
TGTGAAGTAGCACA 
TAGCGGCGACCCCA 
TCACACCCTGTCCA 
TGTGTGAGGAGATA 
TTGGAGGTTCGAGA 
TATGGGGTGGTGAC 
TGTTTAAGGGCCTG 
TTGAAGTAGCACAG 
TGCTCCAACTGGTA 
TCCTTAAACTGGTA 
TAGGAGGACGTGCG 
TTGGTGCTGGGGCT 
TCGGTGCTGGGGCT 
TGGAAGGAAGATCA 



TACCGCGCGGTGAC 

TGCCCTTAAACTGG 

TTGGTCACACCCCG 

TGGGAGTTGCGGGC 

TAGGAGGAGAGAAC 

TGGGTGGACACAAC 

TTCTGCTCGGTGAC 

TTGGGGCGGCTTGA 

TGCGCACGTCCTCC 
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TTTTTTI MM I AGGATTTCGTGTA 
TTTTTT M M M GCCTTAAACTGGA 



DRB345 



TGTACCTGGACAGA 

TGTTCCTGGAGAGA 

TACACTCATACTTA 

TACACTCAGACTTA 

TTCCTGGAGCAGGC 

TTCGAAGCGCGCGT 

TAATCTGCACAGAG 

TAGGGCCCGGCTGT 

TAGGACACTCTGGA 

TGTGTAAACCTCTC 

TCTGTCGAAGCGCA 

TGGGGCCGGGCTGT 

TTCTTCCAGGATGT 

TAACTACGGAGTTG 

TCAAGAAACATGGT 

TTAACCAGGAGGAG 



TTGAAGCTCTCCAC 

TGGGGCGGCCTGTC 

TGCGGCGCGCGTGT 



TTTTCTTGGAGCTG 

TTTCTCTTCCTGGC 

TAACTACGGGGTTG 

TGTATCTGATCAGG 

TG GCGAG GTG GAGA 

TGCCGCAGCTCCGT 

TGGTTCCTGGAGAG 

TGTCGAAGCGCACG 

TGTGTCTGCAGTAG 

TGCTCCACTTGGCA 

TTACGGGGTTGGTG 

TCGGTTCCTGCACA 

TTCCAGTACTCGGC 

TTGTCCACCTCGGG 

TTCTTCGTGGCCGT 



TGGTGTCCACCAGG 

TACTCCGTAGTTGT 

TGAGTGAGAGTTAC 

TGATGCTAGAAACA 

TGTGGAATGGAGAG 

TTAAGCAAGAGGAG 

TGTTCCGGAATGGC 

TGTATCTGCAGTAG 

TACGTGCTGGTCTG 

TAGGCAAGAGGACT 

TGCGGTTCCTGCAG 

TCGGGCCGCGGTGG 

TGTAAACCTCTCCA 

TCTGATCAGGCTCC 

TTCCAGGACTCGGC 

TAACCATTCACAGA 

TCGGGCCCTGGTGG 

TGTTGCGGAACGGC 

rGGGGGCCGGGTGT 



TTCCTGGAAGACAC 
TGCGGGGTGGACAA 
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TTCTGCTCCAGGATG 

TTCAACTACTGCAGA 

TTGTACCTGGAGAGA 

TTACCTCTGCACTCC 

TTGTGAAGCTCTCCA 

TTCCGCGGCGCGCGT 

TTCTGATCAGGTTCC 

TTAATGGGACGGAGC 

TTTATGGAAGTATCT 

TTTCTGGAGTAGGTG 

TTCGGGCCGCGGTGG 

TTCTGTGCAGGAACC 

TTCGAAGAGGAGGAC 



TTCAATTACTGCAGA 

TTCACCTACTGCAGA 

TTCTGCCTGGATAGA 

TTGTAATTGTCCACC 

TTCACCAGGGCCCGC 

TTTGCGGTACCTGGA 

TTCCTGCAGCACCAC 

TTGCGGCGGGCCTGT 

TTGGAGGACTCGGCA 

TTGACACAACTACGG 

TTGATACAACTACGG 

TTACTCAGACTTACA 

TTTGAGACTTACACA 

TTTACGGGGTTGTGG 



TTGTAGTTGTCCACC 

TTAACCAGGAGGAGT 

TTAACCAAGAGGAGT 

TTTCCACAGCCCCGT 

TTCAGCCAGAAGGAC 

TTGGAGGAGTTCCTG 

TTGAACTCCTCCTGG 

TTAACCACTCACAGA 

TTGGCCGGGCTGTTC 

TTCTCACGAGTCCTG 

TTGTCGAAGCGCAAG 

TTCCTCCTGGTCTGT 



HLA-A 



TTTCAGTCTGTGAGT 

TTCCGCAGGCTCTCT 

TTATGAGGTATTTCT 

TTGGACATGGAGGTG 

TTC-AGGTAGGCTCTC 



TTTAGTCTTGGGGGC 

TTGGTCGCCAGGTCC 

TTGGGAGCCCGCCCA 

TTCCGCTGCTCCGCC 

TTTGAAGGCCCAGTC 

TTG GAG CC ATAC ATC 

TTCCACTCGACGCAC 

TTCAGGTCGCAGCCA 

TTGGTCTGCGGGAGC 

TTGAGGTAGACTCTC 

TTGGGAGACACGGAA 

TTCCCGTCCACGCAC 

TTGTCCAGTCGGTCA 
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10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



TATCCAGAGGATGT 

TCGCGATCCGCAGG 

TCGGGGACACGGAA 

TGGAGGAGGAACAG 

TAAGTGAAGGCCCA 

TGGGGCTTGGGGAG 

TCAGACTAAGGGAG 

TGTCCTGGGGGGGT 

TCGTCGTAAGCGTC 

TAGGTCCACTCGGT 

TGGTAGGCTCTCAA 

TGCGCGATCCGCAG 

TGTGTCCTGGGTCT 

TATCC.AGATAATGT 

TCCGTCGTAGGCGT 

TTCATATTCCGTGT 

TGGGACCCCCCCCA 

TGGCGCATGGACCG 

TGCTGCTCCGCCGC 

TAGCGCAGGTGCTC 

TCTACCTGGATGGC 

TGGTATTTCTTCAC 

TATATGAAGGGGGA 

TCGGTGTGTGCCCG 

TCCGGGAGTGGAGA 

TCGGACGGGGCCAA 

TGGGTGAGGCGGAG 

TAGGAGACAGGGAA 

TAGAGGGAGGAGGG 

TGCACATGGCAGGT 

TGAGGTGGTGGGCC 

TATGAACAGCACGC 

TCGCGGGCCGGCAG 

TGGAGGGTGAGAGT 

TGACGGTGATGGC 

TCCGTCGTAAGCGT 

TGAGTATTGGGACC 

TCTGGCCTGGTTCT 

rTACCTCATGGAGTG 

TAGCCGGCATGTCC 

TCACGTGGCATCCA 

TGGTGGGGAGGTTC 

TAGGAGAAGACATA 

TCTGCTGGTGGGCC 

TTGACCCAGACCAG 

TCGGGCGGAGCAGT 

TAGGTTCGCTCGGT 

TCATATGGGTCCTG 

TCGTCCTGGGGGGG 

TGCACGTGCGTGGA 

TGGTATTTCTACAC 

TAGGAGGAGAGATA 

TCCCGAACCCTCGT 

TGCCACATGGGCCG 

TAGCAGGAGGAGCC 

TATCCAGATGATGT 

TGGATGGGGAGCAC 

TGG.ACTGGCGCTTC 

TAGCTTGTAAAGTG 

TGATAATGTATGGC 
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10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



nrCACACCCTCCAG 

TTCTACGTGGACAAC 

TTCGAGCGAACCTGG 

TTCGAGACAGCCTGC 

TTGGGCTACGTGGAC 

TTACCACCAGTACGC 

TTGAGGATGTATGGC 

TTGATCTGAGCCGCC 

TTGATCTGAGCTGCC 

TTGATGATGTATGGC 

TTATACCTGGAGAAC 

TTGATGTATGGGTGC 

TTTCCGCAGGTTCTC 

TTGAGCAGAGATAAA 

TTGGGCTGGGAAGAC 



TTGATGGGCAGGACT 

TTTCACTTTCCCTGT 

TTCCCACGATGTGGA 

TTAGTCATATGCGTT 

TTGGCGGACATGGCG 

TTGCTCCGCCTCACG 

TTCGTCGTAAGCGTT 

TTGATC.ATGTTTGGC 

TTCACGGACGCCCCC 

TTGCTCCTCGTGGTC 

TTAGTGACGGAGTGG 

TTAGTGATATGTGTG 

TTGGTGTGAGGTGCG 

TTTGGGACTTGGGGT 



TTGGGGAGTGAGAGA 

TTGGGTGAG.ATGAGC 

TTGGTGTTGGAGGGG 

TTGAGAGGGTGGGGA 

TTGGAAGAGAGGGAA 

TTGGGAAGAGAGGGA 

TTGGTAAGGGTGGTG 

TTGGGGGTGGGTGGA 

TTGGGGGATGGGGGG 

TTGGAGAGGGAGGAG 

TTCGGAAGGGGGGGG 

TTGGAGTGGGTGGAG 

TTGCGAAGGTGGGGA 



TTCGGGTAGCAGCGG 

TTTGAAGGGGGGGTG 

TTGGGGGGGGGTTGG 

TTTGTGGGTG,AGGGG 

TTGGGTGATGGGGGG 

TTGGATGGGGGTGGG 

TTAGCTGAGAGGAGG 

TTGTGGTAAGGGTGG 

TTGGGGGGGGGGGGA 

TTGGTGGGAATACTG 

TTGGTGGGAATACTG 

TTGTTGTGAGAGGAT 

TTTGGTGTGGATGGT 



TTTCGGAGTTGTGCT 

TTGGTGAGCGAGAGG 

TTTGAGAGGGGGGGG 

TTGAGTGGGTGGAGT 

TTTAGATGATGTGGA 
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10 



15 



20 



25 



30 



35 



TTGATCCGCAGGTTC 

TTTAGAGCAGGAGAG 

TTCCTGGCAGCGGGA 

TTTCATGGAGTGAGA 

TTCCGGCCGCGGGAA 

TTCCAGGACACGGAG 

TTCCGGGACACGGAG 

TTGCAGCCACACATC 

TTGGATGGTGTGAGA 

TTAACATCATCTGGA 

TTTCCTCCTCCACAT 

TTTGGGCGGAGCAGT 

TTTGGAGGGGATGGA 

TTCGCAGGAAGCGCC 



TTGGCCGTCATGGCG 

TTATGCGTCCTGGGG 

TTATGCGTCTTGGGG 

TTTTTCCCTGTCTCC 

TTTCAGGGTGGCCTC 

TTGAGGAGGAACAGC 

TTGCGCAGGGTCGCC 

TTCAGCCAAACATCC 

TTACTTCTGGAAGGT 

TTTCCTCTGGACGGT 

TTGGAGAAGAGATAC 



TTATTCCGTGTCTCC 

TTTGAATCTGTGAGT 

TTGGGCCGTGGGGCG 

TTCGGGGGACATGGC 

TTTAGAAGCTGTGAG 

TTGGAACTGCGTGTC 

TTCGAGCTCCGTGTC 

TTACTCCACGCACGG 

TTCTAGGTGGAGGAC 



HLA-C 



40 



45 



50 



55 



60 



TTGAGCTGGGAGGC 

TATGACAAGAGCCA 

TAGGCTCTGGGCTC 

TGGAGTGGGAGGAG 

TTCAGACCCTCCAG 

TAGTGCACGCAGAG 

TGCCGTGGTAGGGG 

TCGCGCAGAACGCC 

TAGTAGCCGGGGAG 

TGGAGCGGAC.AGCC 

TCAGGTAGGGTGTG 

TGGTTGGGGGCTCC 



TGCCCCAAGCGCTC 

TGGGCATGAGGAGT 

TGCGGCTCGGCGGG 

TTCCAGTGGATGTA 

TGGGATGACCAGTT 

TCTCAGTGGGTCAG 

TGAAGCCGTGGTGG 

TTAGTTTCCGC.AGG 

TCAGGTCGCAGCCA 

TCACTGCGATGAAG 

TGGTATGACCAGTT 
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TTACAGCCAGGCCAG 

TTGAGGCGGAGCAGC 

TTTGGTTGTAGTAGC 

TTACCTGCGGAAACT 

TTCGGCCCAGGTCTC 

TTGCTGGACGGAGCC 

TTGAGGTTCGGCAGG 

TTCCGCCAGGCACAG 

TTCCTCCTACACATC 



TTACGGCGGAGCAGC 

TTAGCGCGCGGAACC 

TTTTCACTCGGTCAG 

TTACGCCGCGAGTCC 

TTTGGAGCAGGA.GGG 

TTGGGTATGACCAGT 

TTATACCTGGAGAAC 

TTGGGTTCGGGGCTC 

TTGAGGGCTAGGACA 

TTATCTGAGGGGCTG 

TTGGGGGAGAGGGGC 

TTGGTGGGGGTTGTA 



TTGGTGGGGAAAGTA 

TTAGGGTGTGGTTGC 

TTTGGGGGGGGGAAC 

TTATGATGTGAGACG 

TTGTCGGTGTGGTGG 

TTGTAGTAGGGGGGT 

TTAGGATGTGAGAGG 

TTGGTAGGGTGTCTG 

TTAGGGTGTTGTTGC 

TTGATAGGAGGAAGA 

TTG AC AACG AG GAGA 

TTGCGGGGGGGAGGC 

TTGGTGAGGGGCTGT 



TTGGAGGGGGTGGGA 

TTGGGTATAAGGAGT 

TTTGGAGAATATGTA 

TTGGGTGGAGGGGTG 

TTGGGGGGGAAGGGG 

TTTAGTAGGGGGGTA 

TTAGGTGGTGTGAGG 

TTAGGGGAGGAACTG 

TTGGGCAGGGTGACT 

TTGGTGTGAGAGCGG 

TTTGGAGGGCGGAAC 



TTAGGCGGGGGAGGG 

TTAGTGGAGGAACTG 

TTGGGGAGGAAGTGT 

TTGGTGGAGGGGTGG 

TTGCAGGAGGAGC.AG 

TTTGAGTGTGTGATC 

TTGCGCCGTGTGCGC 

TTTGGAGGGAGAGGC 

TTAGTGGGTGAGGGT 

TTGAGAGG.ATGGAGA 

TTGAGAGCGTGGAGA 

TTGCAGGAGGATGAG 



TTGAGGGAGGAGAGC 
TTTGGTGGGTGGCGT 
TTTAGGGGGGAGCAG 
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TTCTCACACCATCCA 
TTTGCGGCGGAGCAG 



10 



TTTCTGAGCCGCCGT 

TTGGCGGAGGAGCAG 

TTCCGCTGCGGACAC 

TTTATAACGAGTTCG 

TTCACATCCTCCAGA 

TTCCGTGTCCGCGGC 

TTCGTGGACGACACA 

TTCCGCTGTGTCCGC 

TTGAAGAATGGGAAG 



wo 00/65088 



PCT/EPOO/03636 



-60- 



CLAIMS 

5 1 . A method of identifying a set of extendible primers for use in 

the identification, typing or classification of a nucleic acid of known 
sequence having known polymorphisms wherein: 

i) all possible nucleotide sequences of a chosen length of the 

nucleic acid are identified and their corresponding extendible primers, 
10 ii) at least one extendible primer is removed from the set 

wherein the at least one primer removed identifies a segment of the nucleic 
acid identified by at least one other primer. 

2. The method of claim 1 , wherein between steps i) and ii): 
15 ia) potential extensions for each primer are identified with 

respect to each nucleotide sequence, 

ib) for each extendible primer the identified potential extensions 

are compared to determine which pairs of sequences can be discriminated 
by the primer. 

20 

3. The method of claim 1 or claim 2, wherein a matrix of primers 
and pairs of primer extensions is prepared in binary form and is subjected 
to analysis by a set covering problem (SCP) algorithm. 

25 4. The method of claim 3, wherein a greedy algorithm is used. 

5. The method of claim 3, wherein a CFT algorithm is used 

which involves a Lagrangrian relaxation heuristic. 

30 6. The method of any one of claims 3 to 5, wherein a set of core 

primers is selected as a base for analysis by the SCP algorithm. 
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7. The method of any one of clainns 3 to 6, wherein the set of 

extendible primers identified by the SCP algorithm is subjected to a 
redundancy check. 

5 8. A set of extendible primers, for use in the identification, typing 

or classification of a nucleic acid of known sequences having known 
polymorphisms, identified by the method of any one of claims 1 to 7. 

9. The set of extendible primers of claim 8, in the form of an 
10 array. 

1 0. The set of extendible primers of claim 8 or claim 9, for use in 
the identification, classification or typing of an organism, allele or gene 
selected from class 1 HLA, class 2 HLA and 16S rRNA. 

15 

1 1 . The set of extendible primers of any one of claims 8 to 10, 
wherein the primers are arrayed on a surface of a support in such a way 
that recognisable patterns are formed with different types or alleles. 

20 12. A set of extendible primers, for use in the identification, typing 

or classification of a human leucocyte antigen (HLA) gene as indicated, the 
set comprising about the number of primers indicated and being capable of 
distinguishing about the number of alleles indicated: 





HLA gene 


Number of 
Alleles 


Number of 
Primers 


Class 1 


HLA- A 


91 


172 




HLA-B 


200 


<1000 




HLA-C 


47 


94 


Class II 


DPA-1 


11 


26 




DPB-1 


74 


130 




DQA-1 


17 


130 




DQB-1 


34 


84 




DRB-1 


192 


<1000 




DRB345 


35 


94 
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13. A set of extendible primers, for use in tlie identification, typing 

or classification of 16S rRNA, wherein set comprises about 210 primers 
and is capable of distinguishing at least about 1207 different sequences. 

5 14. The set of extendible primers of claim 12 or claim 13, wherein 

the primers have variable segments substantially as set out in appendix 1 
or appendix 2. 

15. A method of identification, typing or classification of a nucleic 
10 acid of known sequence having known polymorphisms, by the use of the 

set of extendible primers as claimed in any one of claims 8 to 14, which 
method comprises applying the nucleic acid or fragments thereof to the set 
of extendible primers under hybridisation conditions, and effecting 
template-directed chain extension of extendible primers that have formed 
15 hybrids. 

1 6. The method of claim 1 5, wherein the set of extendible primers 
is provided in the form of an array, and template-directed chain extension is 
effected using labelled chain-terminating nucleotide analogues. 

20 

17. The method of claim 16, wherein template-directed chain 
extension is effected using four different fluorescently-labelled chain 
terminating nucleotide analogues, and the results are analysed by total 
internal reflection fluorescence or confocal microscopy. 

25 

1 8. The method of any one of claims 1 5 to 1 7, wherein the 
nucleic acid is a PGR amplimer. 

1 9. The method of any one of claims 1 5 to 1 8, wherein the 
30 nucleic acid is HLA Class 1 or HLA Class 2 or 16S rRNA or a PCR 

amplimer thereof. 
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20. The method of any one of claims 1 5 to 19, wherein a 

dUTP/uracil-DNA-glycosylase system is used to break the nucleic acid into 
fragments. 

5 21. A kit for use in the identification, typing or characterisation of 

a nucleic acid of known sequence having known polymorphisms, 
comprising the set of extendible primers as claimed in any one of claims 8 
to 14. 

10 22. The kit of claim 21 , comprising also a pair of primers for 

effecting PGR amplification of the nucleic acid. 

23. An array of sets of extendible primers as claimed in any one 
of claims 8 to 14, for the simultaneous identification typing or classification 

15 of two or more different HLA genes. 

24. A computer readable storage medium having a program 
recorded thereon, wherein the program consists of instructional steps for 
identifying a set of extendible primers for use in the identification, typing or 

20 classification of a nucleic acid of known sequence having known 
polymorphisms, the steps comprising: 

i) identifying all possible nucleotide sequences of a chosen 
length of the nucleic acid and their corresponding extendible primers. 

ii) removing at least one extendible primer from the set wherein 
25 the at least one primer removed identifies a segment of the nucleic acid 

identified by at least one other primer. 
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25. Computer readable program implement consisting of 

instructional steps for identifying a set of extendible primers for use in the 
identification, typing or classification of a nucleic acid of known sequence 
having known polymorphisms, the steps comprising: 
5 i) identifying all possible nucleotide sequences of a chosen 

length of the nucleic acid and their corresponding extendible primers, 
ii) removing at least one extendible primer from the set wherein 

the at least one primer removed identifies a segment of the nucleic acid 
identified by at least one other primer. 



